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Chapter 7 

Randomness and Statistics 



In this chapter, we are concerned with statistics in a very broad sense. This 
involves generation of (pseudo) random data, display /plotting of data and the 
statistical analysis of simulation results. 

Frequently, a simulation involves the explicit generation of random numbers, 
for instance, as auxiliary quantity for stochastic simulations. In this case it is 
obvious that the simulation results arc random as well. Although there are 
many simulations which are explicitly not random, the resulting behavior of 
the simulated systems may appear also random, for example the motion of 
interacting gas atoms in a container. Hence, methods from statistical data 
analysis are necessary for almost all analysis of simulation results. 

This chapter starts (Sec. 7.1) by an introduction to randomness and statis- 
tics. In Sec. 7.2 the generation of pseudo random numbers according to some 
given probability distribution is explained. Basic analysis of data, i.e., the cal- 
culation of mean, variance, histograms and corresponding error bars, is covered 
in Sec. 7.3. Next, in Sec. 7.4, it is shown how data can be represented graphi- 
cally using suitable plotting tools, gnuplot and xmgrace. Hypothesis testing and 
how to measure or ensure independence of data is treated in Sec. 7.5. How to 
fit data to functions is explained in Sec. 7.6. In the concluding section, a special 
technique is outlined which allows to cope with the limitations of simulations 
due to finite system sizes. 

Note that some examples are again presented using the C programming 
language. Nevertheless, there exist very powerful freely available programs like 
R [R], where many analysis (and plotting) tools are available as additional 
packages. 

7.1 Introduction to probability 

Here, a short introduction to concepts of probability and randomness is given. 
The presentation here should be concise concerning the subjects presented in this 
book. Nevertheless, more details, in particular proofs, examples and exercises, 
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can be found in standard textbooks [Dekking et al (2005), Lefcbvre (2006)]. 
Here often a sloppy mathematical notation is used for brevity, e.g. instead of 
writing "a function g : X ^ Y, y = g{x)" , we often write simply "a function 

A random experiment is an experiment which is truly random (like radioac- 
tive decay or quantum mechanical processes) or at least unpredictable (like 
tossing a coin or predicting the position of a certain gas atom inside a container 
which holds a hot dense gas). 

Definition The sample space O is a set of all possible outcomes of a random 
experiment. 

For the coin example, the sample space is = {head, tail}. Note that a 
sample space can be in principle infinite, like the possible x positions of an atom 
in a container. With infinite precision of measurement we have rj(=^) = [0,L^], 
where the container shall be a box with linear extents {Ly,Lz in the other 
directions, see below). 

For a random experiment, one wants to know the probability that certain 
events occur. Note that for the position of an atom in a box, the probability to 
find the atom precisely at some a;-coordinate x G fi^^^ is zero if one assumes that 
measurements result in real numbers with infinite precision. For this reason, one 
considers probabilities P{A) of subsets A C (in other words A g2^^, 2^^ being 
the power set which is the set of all subsets of O). Such a subset is called 
an event. Therefore P{A) is the probability that the outcome of a random 
experiment is inside A, i.e. one of the elements of A. More formally: 

Definition A probability function P is a function P : 2^ — > [0, 1] with 

p{n) = 1 (7.1) 

and for each finite or infinite sequence Ai, A2, A^, . . . of mutual disjoint events 

{Ai D Aj =0 for i ^ j) we have 

P(Ai U ^2 U ^3 U . . .) = P{Ai) + P{A2) + PiAs) + ... (7.2) 



For a fair coin, both sides would appear with the same probability, hence one 
has P(0) = 0, P({head}) = 0.5, P({tail}) = 0.5, P({head, tail}) = 1. For the 
hot gas inside the container, we assume that no external forces act on the atoms. 
Then the atoms are distributed uniformly. Thus, when measuring the x position 
of an atom, the probability to find it inside the region A=[x,x + Ax] C n(^) 
is P{A) = Ax/L^. 

The usual set operations applies to events. The intersection An B of two 
events is the event which contains elements that are both in A and B. Hence 
P{A n B) is the probability that the outcome of an experiment is contained in 
both events A and B. The complement A^ of a set is the set of all elements of 
fl which are not in A. Since A'^, A are disjoint and Au A'^ = fl, we get from 
Eq. (7.2): 

P(A'=) = 1 - P{A) . (7.3) 
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Furthermore, one can show for two events A,BcO,: 

P{A UB)= P{A) + P{B) - P{A n B) (7.4) 

Proof P{A) = P{A n n)^ P{A n (B u B"))^ P{{A n B) u n s")) p^a n 

B) + P{A n B"). If we apply this for A U B instead of A, we get P{A U B) = 
P{iAuB)nB) + P{{AuB)nB'')) = P{B) + P{AnB''). Eliminating P(AnS=) from 
these two equations gives the desired result. □ 
Note that Eqs. (7.2) and (7.3) are special cases of this equation. 

If a random experiment is repeated several times, the possible outcomes of 
the repeated experiment are tuples of outcomes of single experiments. Thus, 
if you throw the coin twice, the possible outcomes are (head, head), (head, tail), 
(tail, head), and (tail, tail). This means the sample space is a power of the single- 
experiment sample spaces. In general, it is also possible to combine different 
random experiments into one. Hence, for the general case, if k experiments 
with sample spaces . . . , fi^*^) are considered, the sample space of the 

combined experiment is = fi^^-' x fi^^^ x ... x fl^'^y For example, one can 
describe the measurement of the position of the atom in the hot gas as a com- 
bination of the three independent random experiments of measuring the x, y, 
and z coordinates, respectively. 

If we assume that the different experiments are performed independently, 
then the total probability of an event for a combined random experiment is 
the product of the single-experiment probabilities: P{A''^\ A'-^\ . . . , A'-'^^) = 
P(A(i))P(A(2))...P(AW). 

For tossing the fair coin twice, the probability of the outcome (head,tail) 
is P({(head,head)}) = P({head})P({head}) = 0.5 • 0.5 = 0.25. Similarly, for 
the experiment where all three coordinates of an atom inside the container 
are measured, one can write P{[x,x + Ax] x [y,y + Ay] x [z,z + Az]) = 
P{[x,x + Ax])Pi[y,y + Ay])P{[z,z + Az]) = {Ax/L^){Ay/Ly){Az/L,) = 
AxAyAz/{L^LyL^). 

Often one wants to calculate probabilities which are restricted to special 
events C among all events, hence relative or conditioned to C. For any other 
event A we have P(C)= P{{A\J A'')f^C)= P(An C) + P(A= n C), which means 
^')^(C)^ + ^^pi^l]^^ = f • Since P{A n C) is the probability of an outcome in A 
and C and because P{C) is the probability of an outcome in C, the fraction 
^^pIc) ^ gives the probability of an outcome A and C relative to C, i.e. the 
probability of event A given C, leading to the following 

Definition The probability of A under the condition C is 

^^^1^) = ^^gf^ • (7-5) 

As we have seen, we have the natural normalization P{A\C) + P{A'^\C) = 1. 
Rewriting Eq. (7.5) one obtains P{A\C)P{C) = P{A n C). Therefore, the 
calculation of P{Ar\C) can be decomposed into two parts, which are sometimes 
easier to obtain. By symmetry, we can also write P(C|^)P(A) = P{A fl C). 
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Combining this with Eq. (7.5), one obtains the famous Bayes' rule 



P{C\A) 



P{A\C)P{C) 
P{A) 



(7.6) 



This means one of the conditional probabihties P(A\C) and P{C\A) can be 
expressed via the other, which is sometimes useful if P{A) and P{C) are known. 
Note that the denominator in the Bayes' rule is sometimes written as P{A) = 

P{A n (C u 0"))= P{A n c) + P{A n c) = p{A\c)P{C) + p(A|C"=)P(C"=). 

If an event A is independent of the condition C, its conditional probability 
should be the same as the unconditional probability, i.e., P{A\C) — P{A). Using 
PiA n C) = P{A\C)PiC) we get P(A n C) = P(A)P(C), i.e., the probabilities 
of independent events have to be multiplied. This was used already above for 
random experiments, which are conducted as independent subexperiments. 

So far, the outcomes of the random experiments can be anything like the 
sides of coins, sides of a dice, colors of the eyes of randomly chosen people or 
states of random systems. In mathematics, it is often easier to handle numbers 
instead of arbitrary objects. For this reason one can represent the outcomes of 
random experiments by numbers which are assigned via special fTinctions: 

Definition For a sample space fl, a random variable is a function X : f2 — > 
R. For example, one could use X(head)=l and X(tail) = 0. Hence, if one 
repeats the experiments k times independently, one would obtain the number 
of heads by X^iLi ^('-^'*'')7 where w^*-* is the outcome of the i'th experiment. 

If one is interested only in the values of the random variable, the connection 
to the original sample space Q is not important anymore. Consequently, one 
can consider random variables X as devices, which output a random number 
X each time a random experiment is performed. Note that random variables 
are usually denoted by upper-case letters, while the actual outcomes of random 
experiments are denoted by lower-case letters. 

Using the concept of random variables, one deals only with numbers as 
outcomes of random experiments. This enables many tools from mathematics 
to be applied. In particular, one can combine random variables and functions to 
obtain new random variables. This means, in the simplest case, the following: 
First, one performs a random experiment, yielding a random outcome x. Next, 
for a given function g, y — g{x) is calculated. Then, y is the final outcome 
of the random experiment. This is called a transformation Y = g{X) of the 
random variable X. More generally, one can also define a random variable Y by 
combining several random variables X^^\ X^^\ .... X^'^^ via a function g such 



In practice, one would perform random experiments for the random variables 
X^''\ resulting in outcomes x^^\x^^\ x^''\ The final 
number is obtained by calculating y = g{x'^^\x'^^\ . . . ,x^''^). A simple but 
the most important case is the linear combination of random variables Y = 
aiX^^^ + a2X^'^^+ . . .+akX'-'^\ which will be used below. For all examples 



that 




(7.7) 
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considered here, the random variables X'-^\ X'-^K^ . . . , X^'''> have the same prop- 
erties, which means that the same random experiment is repeated k times. 
Nevertheless, the most general description which allows for different random 
variables will be used here. 

The behavior of a random variable is fully described by the probabilities of 
obtaining outcomes smaller or equal to a given parameter x: 

Definition The distribution function of a random variable X is a function 
Fx : M — ^ [0, 1] defined via 

Fx{x) = P{X<x) (7.8) 



The index X is omitted if no confusion arises. Sometimes the distribution 

function is also named cumulative distribution function. One also says, the dis- 
tribution function defines a probability distribution. Stating a random variable 
or stating the distribution function are fully equivalent methods to describe a 
random experiment. 

For the fair coin, we have, see left of Fig. 7.1 



F{x) = < 



X <0 
0.5 < a; < 1 

1 a; > 1 



(7.9) 



For measuring the x position of an atom in the uniformly distributed gas we 
obtain, see right of Fig. 7.1 



F{x) 





x/Lx 
1 



X <0 
<x < 
x> Lj. 



(7.10) 



F(x)| 
1 



0.5 



1 X X 

Figure 7.1: Distribution function of the random variable for a fair coin (left) 
and for the random x position of a gas atom inside a container of length Lx. 

Since the outcomes of any random variable are finite, there are no possible 
outcomes X < x in the limit x — > — oo. Also, all possible outcomes fulfill X < x 
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for X oo. Consequently, one obtains for all random variables limj;^_oo F{^) = 
and lima;_>_|_oo -^(a;) — 1. Furthermore, from Dcf. 7.1, one obtains immediately: 

P{xo < X < xi) = Fxixi) - Fx{xo) (7.11) 

Therefore, one can calculate the probability to obtain a random number for any 
arbitrary interval, hence also for unions of intervals. 

The distribution function, although it contains all information, is sometimes 
less convenient to handle, because it gives information about cumulative proba- 
bilities. It is more obvious to describe the outcomes of the random experiments 
directly. For this purpose, we have to distinguish between discrete random vari- 
ables, where the number of possible outcomes is denumerable or even finite, and 
continuous random variables, where the possible outcomes are non-denumerable. 
The random variable describing the coin is discrete, while the position of an 
atom inside a container is continuous. 

7.1.1 Discrete random variables 

We first concentrate on discrete random variables. Here, an alternative but 
equivalent description to the distribution function is to state the probability for 
each possible outcome directly: 

Definition For a discrete random variable X, the probability mass function 
(pmf) px - [0, 1] is given by 

px{x) = P{X = x) . (7.12) 

Again, the index X is omitted if no confusion arises. Since a discrete random 
variable describes only a denumerable number of outcomes, the probability mass 
function is zero almost everywhere. In the following, the outcomes x where 
Px{x) > are denoted by Xi. Since probabilities must sum up to one, see 
Eq. 7.1, one obtains J2iPx{xi) = 1. Sometimes we also write pi = pxixi). 
The distribution funcion Fx(x) is obtained from the pmf via summing up all 
probabilities of outcomes smaller or equal to x: 

Fx{x) = Px{xi) (7.13) 

Xi<X 

For example, the pmf of the random variable arising from the fair coin Eq. 
(7.9) is given by p{Q) = 0.5 and p{l) = 0.5 {p{x) = elsewhere). The general- 
ization to a possibly unfair coin, where the outcome "1" arises with probability 
p, leads to: 

Definition The Bernoulli distribution with parameter p (0 < p < 1) de- 
scribes a discrete random variable X with the following probability mass func- 
tion 

px{l)=p, px{0) = l-p. (7.14) 

Performing a Bernoulli experiment means that one throws a generalized coin 
and records either "0" or "1" depending on whether one gets head or tail. 
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There arc a couple of important characteristic quantities describing the pmf 
of a random variable. Next, we describe the most important ones for the discrete 
case: 

Definition 

• The expectation value is 

H = E[X] =Y,XiP{X = Xi) =^xipx{xi) (7.15) 

i i 

• The variance is 

= Var[X] = E[{X - E[X]f] = ^(5, - E[X]) (7.16) 



• The standard deviation 

a = VVar[X] (7.17) 

The expectation value describes the "average" one would typically obtain if the 
random experiment is repeated very often. The variance is a measiirc for the 
spread of the different outcomes of random variable. As example, the Bernoulli 
distribution exhibits 

E[X] = Op{0) + lp{l) ^ p (7.18) 
Var[X] = {0 - pfp{0) + {1 - pfp{l) 

= p^il-p) + il~pfp^p{l-p) (7.19) 

One can calculate expectation values of functions g{x) of random variables X 
via E[(7(X)] = ^j_g{xi)px{xi). For the calculation here, we only need that the 
calculation of the expectation value is a linear operation. Hence, for numbers 
ai,Q!2 and, in general, two random variables Xi,X2 one has 

E[aiXi + aaXa] = ai E[Xi] + as £[^2] . (7.20) 
In this way, realizing that E[X] is a number, one obtains: 

£7^ = Var(X) = E[{X - E[X]f] = EiX"^] - 2 E[X E[X]] + E[E[X]2] 

= E[X^]-E[Xf = E[X'^]- fi'^ (7.21) 
<^ E[X2] = CT^+zi^ (7.22) 

The variance is not linear, which can be seen when looking at a linear com- 
bination of two independent random variables Xi,X2 (implying E[XiX2] = 
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E[Xi]E[X2] W) 



0!lXi+a2-^2 



(7.20), W 



(7.21) 



(7.21) 



(7.20) 



Var[Q!iXi + a2^2] 

E[(aiX2 + aaXa)^] - E[aiXi + 03X2]^ 

E[af + 20102X1X2 + aiX|] 
-(aiE[Xi]+a2E[X2])2 

al E[X2] + al E[X|] - aj E[Xi]2 + aj E[X2]' 

a? Var[Xi] + al Var[X2] (' 



(7.23) 



The expectation values are called the n'th moments of the distribu- 

tion. This means that the expectation value is the first moment and the variance 
can be calculated from the first and second moments. 

Next, wc describe two more important distributions of discrete random vari- 
ables. First, if one repeats a Bernoulli experiment n times, one can measure 
how often the result "1" was obtained. Formally, this can be written as a sum of 
n random variables X^*^ which are BernouUi distributed: X = X^Hi ^'^'^ with 
parameter p. This is a very simple example of a transformation of a random vari- 
able, see page 6. In particular, the transformation is linear. The probability to 
obtain x times the result "1" is calculated as follows: The probability to obtain 
exactly x times a "1" is , the other n — x experiments yield "0" which hap- 
pens with probability (1 — p)""^. Furthermore, there are (") = n\/{xl{n — x)!) 
different sequences with x times "1" and n — x times "0" . Hence, one obtains: 

Definition The binomial distribution with parameters n G N and p (0 < 
p <1) describes a random variable X which has the pmf 



A common notation is X ^ B{n,p). 

Note that the probability mass function is assumed to be zero for argument 
values that are not stated. A sample plot of the distribution for parameters 
n = 10 and p = 0.4 is shown in the left of Fig. 7.2. The Binomial distribution 
has expectation value and variance 



(without proof here). The distribution function cannot be calculated analyti- 
cally in closed form. 

In the limit of a large number of experiments (n 00), constrained such that 
the expectation value n = up is kept fixed, the pmf of a Binomial distribution 
is well approximated by the pmf of the Poisson distribution, which is defined as 
follows: Definition The Poisson distribution with parameter /x > describes a 
random variable X with pmf 




(0 < a; < n) 



(7.24) 



E[X] = np 
Var[X] = npil 



P) 



(7.25) 
(7.26) 



Px{x) 




(7.27) 
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4 6 

X 




10 2 



Figure 7.2: (Loft) Probability mass function of the binomial distribution for 
parameters n = 10 and p = 0.4. (Right) Probability mass function of the 
geometric distribution for parameter p = 0.4. 



Indeed, as required, the probabilities sum up to 1, since ^ is the Taylor 
series of e'*. The Poisson distribution exhibits E[X] = /x and Var[X] = /x. Again, 
a closed form for the distribution function is not known. 

Furthermore, one could repeat a Bernoulli experiment just until the first time 
a "1" is observed, without limit for the number of trials. If a "1" is observed for 
the first time after exactly x times, then the first x—1 times the outcome "0" was 
observed. This happens with probability (1 — p)^~^. At the x'th experiment, 
the outcome "1" is observed which has the probability p. Therefore one obtains 

Definition The geometric distribution with parameter p (0 < p < 1) de- 
scribes a random variable X which has the pmf 

Px{x) = {1-pY-^P {x&n) (7.28) 

A sample plot of the pmf (up to x = 10) is shown in the right of Fig. 7.2. The 
geometric distribution has (without proof here) the expectation value E[X] = 
1/p, the variance Var[X] = (1 — p)/p^ and the following distribution function: 



Fx{x) 



a; < 1 



l-(l-p)'" m<x<m + l (meN) 



7.1.2 Continuous random variables 

As stated above, random variables are called continuous if they describe random 
experiments where outcomes from a subset of the real numbers can be obtained. 
One may describe such random variables also using the distribution function, 
see Def. 7.1. For continuous random variables, an alternative description is 
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possible, equivalent to the pmf for discrete random variables: The probability 
density function states the probability to obtain a certain number per unit: 

Definition For a continuous random variable X with a continuous distribu- 
tion function Fx, the probability density function (pdf) px [0, 1] is given 

Consequently, one obtains, using the definition of a derivative and using Eq. 
(7.11) 

Fx{x) = f dxpx{x) (7.30) 

•/ — OO 

P{xo <X <xi) = / dxpx{x) (7.31) 

J Xo 

Below some examples for important continuous random variables are pre- 
sented. First, we extend the definitions Def. 7.1.2 of expectation value and 
variance to the continuous case: 

Definition 

• The expectation value is 

/oo 
dxxpx{x) (7.32) 
-OO 

• The variance is 

r — oo 

Var[X] = E[{X - E[X]f] = {x - E[X]fpx{x) (7.33) 

J CO 

Expectation value and variance have the same properties as for the discrete 
case, i.e., Eqs. (7.20), (7.21), and (7.23) hold as well. Also the definition of the 
n'th moment of a continuous distribution is the same. 

Another quantity of interest is the median, which describes the central point 
of the distribution. It is given by the point such that the cumulative probabilities 
left and right of this point are both equal to 0.5: Definition The median 
Xmed = Med[X] is defined via 

F{x^,i) = 0.5 (7.34) 



The simplest distribution is the uniform distribution, where the probability 
density function is nonzero and constant in some interval [a, 5): Definition The 
uniform distribution, with real-valued parameters a < b, describes a random 
variable X which has the pdf 



Px{x) = < 



X < a 
X < X <b 
x>0 



(7.35) 
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One writes X ^ U{a, b). The distribution function simply rises linearly from 
zero, starting at x = a, till it reaches 1 at a; = 6, see for example Eq. 7.10 for 
the case a = and b = L^. The uniform distribution exhibits the expectation 
value E[X] = (a + b)/2 and variance Va.r[X] = (b - a)'^/12. Note that via 
the linear transformation g{X) = {b — a) * X + a one obtains g{X) ~ U{a,b) 
if X ^ U{0, 1). The uniform distribution serves as a basis for the generation 
of (pseudo) random numbers in a computer, see Sec. 7.2.1. All distributions 
can be in some way obtained via transformations from one or several uniform 
distributions, see Sees. 7.2.2-7.2.5. 

Probably the most important continuous distribution in the context of sim- 
ulations is the Gaussian distribution: 

Definition The Gaussian distribution, also called normal distribution, with 
real- valued parameters and a > 0, describes a random variable X which has 
the pdf 

One writes X ~ N{^,a'^). The Gaussian distribution has expectation value 
E[X] = fi and variance Var[X] = a^. A sample plot of the distribution for 
parameters = 5 and cr = 3 is shown in the left of Fig. 7.3. The Gaussian 
distribution for /x = and u = 1 is called standard normal distribution N{0, 1). 
One can obtain any Gaussian distribution from Xq ~ A''(0, 1) by applying the 
transformation g{Xo) = aXo + ji. Note that the distribution function for the 
Gaussian distribution cannot be calculated analytically. Thus, one uses usually 
numerical integration or tabulated values of A^(0, 1) 



0.15^ • \ • \ • \ • 0.4 




Figure 7.3: (Left) Probability density function of the Gaussian distribution for 
parameters /x = 5 and (7 = 3. (Right) Probability density function of the 
exponential distribution for parameter = 3. 

The central limit theorem describes how the Gaussian distribution arises 
from a sum of random variables: 



14 



A.K. Hartmann: Practical Guide to Computer Simulations 



Theorem Let X^^\ X'-^\ . . . , X*^"^ be independent random variables, which 
follow all the same distribution exhibiting expectation value /x and variance cr^. 
Then 

n 

X = ^XW (7.37) 

i=l 

is in the limit of large n approximately Gaussian distributed with mean n/x and 
variance na^, i.e. X N{nfi,n(7^). 

Equivalently, the suitably normalized sum 

Z = " ^'=\\- ^ (7.38) 

is approximately standard normal distributed Z ~ A''(0, 1). For a proof, please 
refer to standard text books on probability. Since sums of random processes arise 
very often in nature, the Gaussian distribution is ubiquitous. For instance, the 
movement of a "large" particle swimming in a liquid called Brownian motion is 
described by a Gaussian distribution. 

Another common probability distribution is the exponential distribution. 
Definition The exponential distribution, with real- valued parameter fj, > 0, 
describes a random variable X which has the pdf 

Px {x) = - exp {-x/ij) (7.39) 
A* 

A sample plot of the distribution for parameter /i = 3 is shown in the right 
of Fig. 7.3. The exponential distribution has expectation value E[X] = n and 
variance Var[X] = fj?. The distribution function can be obtained analytically 
and is given by 

Fx(a;) = l-exp(-a;//x) (7.40) 

The exponential distribution arises under circumstances where processes 
happen with certain rates, i.e., with a constant probability per time unit. Very 
often, waiting queues or the decay of radioactive atoms are modeled by such 
random variables. Then the time duration till the first event (or between two 
events if the experiment is repeated several times) follows Eq. (7.39). 

Next, we discuss a distribution, which has attracted recently [Newman (2003), 
Newman et al. (2006)] much attention in various disciplines like sociology, physics 
and computer science. Its probability distribution is a power law: 

Definition The power-law distribution, also called Pareto distribution, with 
real-valued parameters 7 > and k > 0, describes a random variable X which 
has the pdf 

^^(^) = |l(./.)-+^ x>l ^'-''^ 

A sample power-law distribution is shown in Fig. 7.4. When plotting a power- 
law distribution with double-logarithmic scale, one sees just a straight line. 
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A discrctizcd version of the power-law distribution appears for example in 
empirical social networks. The probability that a person has x "close friends" 
follows a power-law distribution. The same is observed for computer networks 
for the probability that a computer is connected to x other computers. The 
power-law distribution has a finite expectation value only if 7 > 1, i.e. if it falls 

off quickly enough. In that case one obtains E[X] = 7«;/(7 — 1). Similarly, it 

2 

exhibits a finite variance only for 7 > 2: Var[X] = . The distribution 

function can be calculated analytically: 

Fx{x) = 1-{x/k)-^ {x>1) (7.42) 



^2 



10 

X 




Figure 7.4: (Left) Probability density function of the power-law distribution 
for parameters 7 = 8 and k = 1. (Right) Probability density function of the 
Fisher-Tippett distribution for parameter A = 3 with logarithmically scaled 
y-axis. 

In the context of extreme-value statistics, the Fisher-Tippett distribution 
(also called log-Weibull distribution) plays an important role. 

Definition The Fisher-Tippett distribution, with real- valued parameters 
A > 0,a;o, describes a random variable X which has the pdf 

px{x) = Ae-^^e"^""' (7.43) 

In the special case of A = 1, the Fisher-Tippett distribution is also called Gumhel 
distribution. A sample Fisher-Tippett distribution is shown in the right part 
of Fig. 7.4. The function exhibits a maximum at x = 0. This can be shifted to 
any value /i by replacing x by x — fi. The expectation value is E[X] — v jX, where 
V = 0.57721 ... is the Euler-Mascheroni constant. The distribution exhibits a 
variance of Var[X] = Also, the distribution function is known analytically: 



Fx{x)=e-' 



(7.44) 
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Mathematically, one can obtain a Gumbel (A = 1) distributed random 
variable from n standard normal N{0, 1) distributed variables X^*-* by tak- 
ing the maximum of them and performing the limit n — > oo, i.e. X = 
lim„_>oo max . . . , X*^"^ }. This is also true for some other "well- 

behaved" random variables like exponential distributed ones, if they are nor- 
malized such that they have zero mean and variance one. The Fisher-Tippett 
distribution can be obtained from the Gumbel distribution via a linear trans- 
formation. 

For the estimation of confidence intervals (see Sees. 7.3.2 and 7.3.3) one 

needs the chi-squared distribution and the F distribution, which are presented 
next for completeness. 

Definition The chi-squared distribution, with v > degrees of freedom de- 
scribes a random variable X which has the probability density function (using 
the Gamma function T{x) = t^~^e~* dt) 

and px{x) = for a; < 0. Distribution function, mean and variance are not 
stated here. A chi-squared distributed random variable can be obtained from 
a sum of V squared standard normal distributed random variables Xi: X = 
"Y^I-iX^. The chi-squared distribution is implemented in the GNU scientific 
library (see Sec. 6.3). 

Definition The F distribution, with di,d2 > degrees of freedom describes 
a random variable X which has the pdf 

vAx) /i/V^/^ r(rfi/2 + rf,/2) x'^^'^-^ 

px{x)-d, r(rfi/2)r(d2/2)(dix + d2)<^i/2+''2/2 ^^>"^ ^^-4^^ 

and px{x) = for a; < 0. Distribution function, mean and variance are 
not stated here. An F distributed random variable can be obtained from a 
chi-squared distributed random variable Yi with di degrees of freedom and 
a chi-squared distributed random variable Y2 with ^2 degrees of freedom via 
X = ^y^^- The F distribution is implemented in the GNU scientific library 
(see Sec. 6.3). 

Finally, note that also discrete random variables can be described using 
probability density functions if one applies the so-called delta function 5{x — xq). 
For the purpose of computer simulations this is not necessary. Consequently, 
no further details are presented here. 



7.2 Generating (pseudo) random numbers 

For many simulations in science, economy or social sciences, random numbers 
are necessary. Quite often the model itself exhibits random parameters which 
remain fixed throughout the simulation; one speaks of quenched disorder. A 
famous example in the field of condensed matter physics are spin glasses, which 
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are random alloys of magctic and non-magnetic materials. In this case, when 
one performs simulations of small systems, one has to perform an average over 
different disorder realizations to obtain physical quantities. Each realization of 
the disorder consists of randomly chosen positions of the magnetic and non- 
magnetic particles. To generate a disorder realization within the simulations, 
random numbers are required. 

But even when the simulated system is not inherently random, very of- 
ten random numbers are required by the algorithms, e.g., to realize a finite- 
temperature ensemble or when using randomized algorithms. In summary, the 
application of random numbers in computer simulations is ubiquitous. 

In this section an introduction to the generation of random numbers is given. 
First it is explained how they can be generated at all on a computer. Then, 
different methods are presented for obtaining numbers which obey a target dis- 
tribution: the inversion method, the rejection method and Box-MiiUer method. 
More comprehensive information about these and similar techniques can be 
found in Rcfs. [Morgan (1984), Devroye (1986), Press et al. (1995)]. In this sec- 
tion it is assumed that you are familiar with the basic concepts of probability 
theory and statistics, as presented in Sec. 7.1. 

7.2.1 Uniform (pseudo) random numbers 

First, it should be pointed out that standard computers are deterministic ma- 
chines. Thus, it is completely impossible to generate true random numbers 
directly. One could, for example, include interaction with the user. It is, for 
example, possible to measure the time interval between successive keystrokes, 
which is randomly distributed by nature. But the resulting time intervals de- 
pend heavily on the current user which means the statistical properties cannot 
be controlled. On the other hand, there arc external devices, which have a 
true random physical process built in and which can be attached to a computer 
[Qantis, Westphal] or used via the internet [Hotbits]. Nevertheless, since these 
numbers are really random, they do not allow to perform stochastic simulations 
in a controlled and reproducible way. This is important in a scientific context, 
because spectacular or unexpected results are often tried to be reproduced by 
other research groups. Also, some program bugs turn up only for certain ran- 
dom numbers. Hence, for debugging purposes it is important to be able to run 
exactly the same simulation again. Furthermore, for the true random numbers, 
either the speed of random number generation is limited if the true random 
numbers are cheap, or otherwise the generators are expensive. 

This is the reason why pseudo random numbers are usually taken. They are 
generated by deterministic rules. As basis serves a number generator function 
randO for a uniform distribution. Each time randO is called, a new (pseudo) 
random number is returned. (Now the "pseudo" is omitted for convenience) 
These random numbers should "look like" true random numbers and should 
have many of the properties of them. One says they should be "good" . What 
"look like" and "good" means, has to be specified: One would like to have 
a random number generator such that each possible number has indeed the 
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same probability of occurrence. Additionally, if two generated numbers r^.r^ 
differ only slightly, the random numbers ri+i,rk+i returned by the respective 
subsequent calls should differ sustancially, hence consecutive numbers should 
have a low correlation. There are many ways to specify a correlation, hence 
there is no unique criterion. Below, the simplest one will be discussed. 

The simplest methods to generate pseudo random numbers are linear con- 
gruential generators . They generate a sequence Xi,X2, ... of integer numbers 
between and m — 1 by a recursive rule: 

Xn+i = {axn + c)modm . (7.47) 

The initial value xq is called seed. Here we show a simple C implementation 
lin_con() . It stores the current number in the local variable x which is declared 
as static, such that it is remembered, even when the function is terminated 
(see Sec. 1.2). There are two arguments. The first 
argument set_seed indicates whether one wants 
to set a seed. If yes, the new seed should be passed 
as second argument, otherwise the value of the 
second argument is ignored. The function returns 
the seed if it is changed, or the new random number. Note that the constants a 
and c are defined inside the function, while the modulus M is implemented via 
a macro RNG_MODULUS to make it visible outside lin_con(): 

#define RNG.MODULUS 32768 /* modulus */ 

int lin.conCint set.seed, int seed) 
{ 

static int x = 1000; /* current random number */ 

const int a = 12351; /* multiplier */ 

const int c = 1; /* shift */ 



GET SOURCE CODE 

DIR: randomness 
FILE(S): rng.c 



if(set_seed) /* new seed ? */ 

X = seed; 

else /* new random number ? */ 

X = (a*x+c) 7. RNG.MODULUS; 

return (x) ; 

} 

If you just want to obtain the next random number, you do not care about 
the seed. Hence, we use for convenience rn_lin_con() to call lin_con() with 
the first argument being 0: 

int rand_lin_con() 
{ 

retiirn(lin_con(0,0)) ; 

} 
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If wc want to set the seed, we also use for convenience a special trivial 
function seed_lin_con() : 

void sraiid_lin_con(iiit seed) 
-C 

lin_con(l, seed); 

} 

To generate random numbers r distributed in the interval [0, 1) one has to 
divide the current random number by the modulus m. It is desirable to obtain 
equally distributed outcomes in the interval, i.e. a uniform distribution. Ran- 
dom numbers generated from this distribution can be used as input to generate 
random numbers distributed according to other, basically arbitrary, distribu- 
tions. Below, you will sec how random numbers obeying other distributions can 
be generated. The following simple C function generates random numbers in 
[0, 1) using the macro RNG_MODULUS defined above: 

double drand_lin_con() 

i 

returnC (double) lin_con(0,0) / RNG_MODULUS) ; 

} 

One has to choose the parameters a, c, m in a way that "good" random num- 
bers are obtained, where "good" means "with less correlations". Note that in 
the past several results from simulations have been proven to be wrong because 
of the application of bad random number generators [Ferrenberg et al. (1992), 
Vattulainen et al. (1994)]. 

Example To see what "bad generator" means, consider as an example the 
parameters a = 12351, c = l,m = 2^^ and the seed value /o — 1000. 10000 
random numbers are generated by dividing each of them by m. They are dis- 
tributed in the interval [0, 1). In Fig. 7.5 the distribution of the random numbers 
is shown. 

The distribution looks rather flat, but by taking a closer look some regular- 
ities can be observed. These regularities can be studied by recording fc-tuples 
of k successive random numbers {xi,Xi+i, . . . ,Xi+k-i)- A good random num- 
ber generator, exhibiting no correlations, would fill up the fc-dimcnsional space 
uniformly. Unfortunately, for linear congruential generators, instead the points 
lie on {k — l)-dimensional planes. It can be shown that there are at most of the 
order m^/'^ such planes. A bad generator has much fewer planes. This is the 
case for the example studied above, sec top part of Fig. 7.6 

The result for a = 123450 is even worse: only 15 different "random" munbers 
are generated (with seed 1000), then the iteration reaches a fixed point (not 
shown in a figure). 

If instead a = 12349 is chosen, the two-point correlations look like that 
shown in the bottom half of Fig. 7.6. Obviously, the behavior is much more 
irregular, but poor correlations may become visible for higher A;-tuples. 

A generator which has passed several empirical tests is a = 7^ = 16807, 
m = 2^^ — 1, c = 0. When implementing this generator you have to be careful, 
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Figure 7.5: Distribution of random numbers in the interval [0, 1) obtained 
from converting a histogram into a pdf, see Sec. 7.3.3. The random num- 
bers are generated using a hnear congruential generator with the parameters 
a= 12351, c= l,m = 2i^ 

because during the calculation numbers are generated which do not fit into 32 
bit. A clever implementation is presented in Ref. [Press et al. (1995)]. Finally, 
it should be stressed that this generator, like all linear congruential generators, 
has the low-order bits much less random than the high-order bits. For that 
reason, when you want to generate integer numbers in an interval [1,N], you 
should use 

r = l+(int) (N*x_n/m) ; 

instead of using the modulo operation as with r=l+(x_n 7, N). 

In standard C, there is a simple built-in random number generator called 
randO (see corresponding documentation), which has a modulus m = 2^^, 
which is very poor. On most operating systems, also draiid48() is available, 
which is based on m = 2'^^ {^a=,c= 11) and needs also special arithmetics. It is 
already sufficient for simulations which no not need many random numbers and 
do not required highest statistical quality. In recent years, several high-standard 
random number generators have been developed. Several very good ones are 
included in the freely availabe GNU scientific library (see Sec. 6.3). Hence, you 
do not have to implement them yourself. 

So far, it has been shown how random numbers can be generated which are 
distributed uniformly in the interval [0, 1). In general, one is interested in ob- 
taining random numbers which are distributed according to a given probability 
distribution with some density p{x). In the next sections, several techniques 
suitable for this task are presented. 




Figure 7.6: Two point correlations Xi+i{xi) between successive random numbers 
Xi, Xi+i. The top case is generated using a linear congruential generator with the 
parameters a = 12351, c = 1, m = 2^^, the bottom case has instead a = 12349. 



7.2.2 Discrete random variables 

In case of discrete distributions with finite number of possible outcomes, one can 
create a table of the possible outcomes together with their probabilities px{xi) 
{i = 1, . . . , imax), assuming that the Xi are sorted in ascending order. To draw 
a number, one has to draw a random number u which is uniformly distributed 
in [0, 1) and take the entry j of the table such that the sum sj = J2''i=iPxi^i) 
of the probabilities is larger than u, but sj-i = X]i=i Px{i) < u. Note that one 
can search the array quickly by bisection search: The array is iteratively divided 
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it into two halves and each time continued in that half where the corresponding 
entry j is contained. In this way, generating a random number has a time 
complexity which grows only logarithmically with the number imax of possible 
outcomes. This pays off if the number of possible outcomes is very large. 

In exercise (1) you are asked to write a function to sample from the proba- 
bility distribution of a discrete variable, in particular for a Poisson distribution. 

In the following, we concentrate on techniques for generating continuous 
random variables. 

7.2.3 Inversion Method 

Given is a random number generator drcindO which is assumed to generate 
random numbers U which are distributed uniformly in [0,1). The aim is to 
generate random numbers Z with probability density pz{z)- The corresponding 
distribution function is 

Fz{z) = P{Z < z) = I dz'pziz') (7.48) 

J — oo 

The target is to find a function .9(m), such that after the transformation 
Z = g{U) the outcomes of Z are distributed according to (7.48). It is assumed 
that g can be inverted and is strongly monotonically increasing. Then one 
obtains 

Fz{z) = P{Z < z) = P{g{U) <z) = P{U < g-\z)) (7.49) 

Since the distribution function Fjj{u) = P{U < u) for a uniformly distributed 
variable is just Fu{u) = u {u G [0,1]), one obtains Fz{z) = g~^{z). Thus, 
one just has to choose g{z) = F^^{z) for the transformation fimction in order 
to obtain random numbers, which are distributed according to the probability 
distribution Fz{z). Of course, this only works if Fz can be inverted. If this 
is not possible, you may use the methods presented in the subscqiient sections, 
or you could generate a table of the distribution function, which is in fact a 
discretized approximation of the distribution function, and use the methods for 
generating discrete random numbers as shown in Sec. 7.2.2. This can be even 
refined by using a linearized approximation of the distribution function. Here, 
we do not go into further details, but present an example where the distribution 
function can be indeed inverted. 

Example Let us consider the exponential distribution with parameter fj,, 
with distribution function Fz{z) = 1 — ex.p{—z/^), see page 14. Therefore, one 
can obtain exponentially distributed random numbers Z by generating uniform 
distributed random numbers u and choosing z = — /zln(l — u). 

The following simple C function generates 
a random number which is exponentially dis- 
tributed. The parameter n of the distribution is 
passed as argument. 



GET SOURCE CODE 

DIR: random 
FILE(S): expo.c 
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Figure 7.7: Histogram pdf (see page 35) of random numbers generated according 
to an exponential distribution = 1) compared with the probabihty density 
function (straiglit line) in a logarithmic plot. 



double rand_expo (double mu) 
{ 

double randnxm; /* random number U(0,1) */ 

randnum = drand48(); 

retiirn(-mu*log(l-randnum) ) ; 

} 

Note that wc use in line 4 the simple drand48() random number generator, 
which is included in the C standard library and works well for applications with 
moderate statistical requirements. For more sophisticated generates, see e.g. 
the GNU scientific library (see Sec. 6.3). 

In Fig. 7.7 a histogram pdf (see page 35) for 10^ random numbers generated 
in this way and the exponential probability function for /x = 1 are shown with a 
logarithmically scaled y-axis. Only for larger values are deviations visible. They 
are due to statistical fluctuations since pz{z) is very small there. 

7.2.4 Rejection Method 

As mentioned above, the inversion method works only when the distribution 
function P can be inverted analytically. For distributions not fulfilling this 
condition, sometimes this problem can be overcome by drawing several random 
numbers and combining them in a clever way. 

The rejection method works for random variables where the pdf p{x) fits 
into a box [xo,Xi) x [0,j/inax), i-e., p{x) = for a: ^ [xo,xi] and p{x) < j/max- 
The basic idea of generating a random number distributed according to p{x) is 
to generate random pairs {x,y), which are distributed uniformly in [a;o,a;i] x 
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[0,2/max] and accept only those numbers x where y < p{x) holds, i.e., the pairs 
which are located below p{x), see Fig. 7.8. Therefore, the probability that x is 
drawn is proportional to p{x), as desired. 

The following C function realizes the rejection 
method for an arbitrary pdf. It takes as argu- 
ments the boundaries of the box y_inax, xO and 
xl as well as a pointer pdf to the function realiz- 



GET SOURCE CODE 



DIR: randomness 
FILE(S): reject.c 



ing the pdf. For an explanation of function pointers, see Sec. 1.4. 



double reject (double y_max, double xO, 
double (* pdf ) (double)) 



double xl. 



int found; 

double x,y; 
found = 0; 



/* flag if valid nvunber has been foimd */ 
/* random points in [xO,xl]x[0,p_max] */ 



wliile( !f oimd) 
{ 

X = xO + (xl-x0)*drand48() ; 
y = y_max *drand48() ; 
if(y <= pdf (x)) 
foimd = 1; 

> 

return (x) ; 



/* loop imtil number is generated */ 

/* uniformly on [xO,xl] */ 
/* uniformly in [0,p_majx] */ 
/* accept ? */ 



In lines 9-10 the random point, which is uniformly distributed in the box, is 
generated. Lines 11-12 contain the check whether a point below the pdf curve 
has been found. The search in the loop (lines 7-13) continues until a random 
number has been accepted, which is returned in line 14. 
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Example The rejection method is appHed to a pdf, which has density 1 in 
[0, 0.5) and rises linearly from to 4 in [1, 1.5). Everywhere else it is zero. This 
pdf is realized by the following C function: 

double pdf (double x) 
{ 

if( (x<0)ll 

((x>=0.5)&Se(x<1)) I I 

(x>1.5) ) 

return(O.O) ; 
else if ((x>=0)&&(x<0.5)) 

return (1.0) ; 

else 

return (4. 0* (x-1) ) ; 

y 

The resulting empirical histogram pdf is shown in Fig. 7.9. 




Figure 7.9: Histogram pdf (see page 35) of 10^ random numbers generated using 
the rejection method for an artificial pdf. 

The rejection method can always be applied if the probability density is 
boxed, but it has the drawback that more random numbers have to be generated 
than can be used: If yl = {xi — a:;o)2/max is the area of the box, one has on average 
to generate 2A auxiliary random numbers to obtain one random number of the 
desired distribution. If this leads to a very poor efficiency, you can consider to 
use several boxes for different parts of the pdf. 

7.2.5 The Gaussian Distribution 

In case neither the distribution function can be inverted nor the probability fits 
into a box, special methods have to be applied. As an example, a method for 



26 



A.K. Hartmann: Practical Guide to Computer Simulations 



generating random numbers distributed according to a Gaussian distribution is 
considered. Other methods and examples of how different techniques can be 
combined are collected in [Morgan (1984)]. 

The probability density function for the Gaussian distribution with mean /i 
and variance is shown in Eq. (7.36), see also Fig. 7.10. It is, apart from 
uniform distributions, the most common distribution occurring in simulations. 

0.5 I ' 1 ' 1 ■ , ' 1 




Figure 7.10: Gaussian distribution with zero mean and unit width. The circles 
represent a histogram pdf (see page 35) obtained from 10"^ numbers drawn with 
the Box-Miiller method. 

Here, the case of a standard Gaussian distribution (/x = 0, a = 1) is con- 
sidered. If you want to realize the general case, you have to draw a standard 
Gaussian distributed number z and then use az + fj, which is distributed as 
desired. 

Since the Gaussian distribution extends over an infinite interval and because 
the distribution function cannot be inverted, the methods from above are not 
applicable. The simplest technique to generate random numbers distributed 
according to a Gaussian distribution makes use of the central limit theorem 
7.1.2. It tells us that any sum of K independently distributed random variables 
Ui (with mean fj, and variance v) will converge to a Gaussian distribution with 
mean Kfi and variance Kv. If again Ui is taken to be uniformly distributed in 
[0, 1) (which has mean jj, = 0.5 and variance v = 1/12), one can choose K = 12 
and the random variable Z = J2i=i Ui — Q will be distributed approximately 
according to a standard Gaussian distribution. The drawbacks of this method 
are that 12 random numbers are needed to generate one final random number 
and that numbers larger than 6 or smaller than -6 will never appear. 

In contrast to this technique the Box-Miiller method is exact. You need two 
random variables Ui, U2 uniformly distributed in [0, 1) to generate two indepen- 
dent Gaussian variables Ni,N2. This can be achieved by generating ui, U2 from 
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Ui,U2 and assigning 



m = 



-\/— 21og(l — Ui) cos{2nU2) 



n2 = 



-\/— 21og(l — Ml) sin(27rU2 ) 



A proof that ni and n2 are indeed distributed according to (7.36) can be found 
e.g. in [Press et al. (1995), Morgan (1984)], where also other methods for gen- 
erating Gaussian random numbers, some even more efficient, are explained. A 
method which is based on the simulation of particles in a box is explained in 
[Fernandez and Criado (1999)]. In Fig. 7.10 a histogram pdf of 10^ random 
numbers drawn with the Box-Miiller method is shown. Note that you can find 
an implementation of the Box-Miiller method in the solution of Exercise (3). 

7.3 Basic data analysis 

The starting point is a sample of n measured points {xo,xi, Xn-i} of 
some quantity, as obtained from a simulation. Examples are the density of a 
gas, the transition time between two conformations of a molecule, or the price 
of a stock. We assume that formally all measurements can be described by 
random variables X, representing the same random variable X and that all 
measurements are statistically independent of each other (treating statistical 
dependencies is treated in Sec. 7.5). Usually, one does not know the underlying 
probability distribution F{x), having density p{x), which describes X. 

7.3.1 Estimators 

Thus, one wants to obtain information about X by looking at the sample 

{xo,Xi, . . . , Xn-i}. In principle, one does this by considering estimators h = 
i). Since the measured points are obtained from random vari- 
ables, H = h{Xo,Xi, . . . ,Xn-i) is a random variable itself. Estimators are 
often used to estimate parameters 9 of random variables, e.g. moments of dis- 
tributions. The most fundamental estimators are: 

• The mean 




n-1 



(7.50) 



• The sample variance 




n-1 



(7.51) 



i=0 



The sample standard deviation is s = 
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As example, next a simple C function is shown, 
which calculates the mean of n data points. The 
function obtains the number n of data points and 
an array containing the data as arguments. It 
returns the average: 



GET SOURCE CODE 



DIR: randomness 
FILE(S): mean.c 



double meanCint n, double *x) 
■C 

double sum = 0.0; 
int i ; 



/* sum of values */ 
/* counter */ 



for(i=0; i<n; i++) 

sum += X [i] ; 
return (sum/n) ; 



/* loop over all data points */ 



You are asked to write a similar function for calculating the variance in exercise 
(3). 

The sample mean can be used to estimate the expectation value fx = E[X] of 
the distribution. This estimate is unbiased, which means that the expectation 
value of the mean, for any sample sizes n, is indeed the expectation value of 
the random variable. This can be shown quite easily. Note that formally the 
random variable from which the sample mean x is drawn is X = ^ J2i=o -^i- 



^_=E[X]=E 



^ n — 1 "I ^ n — 1 ^ 

-VX, = - VE[Xi] = -nE[X]=E[X]=M (7.52) 



i=0 



Here again the linearity of the expectation value was used. The fact that the 
estimator is unbiased means that if you repeat the estimation of the expectation 
value via the mean several times, on average the correct value is obtained. This 
is independent of the sample size. In general, the estimator h for a parameter 6 
is called unbiased if E[/i] = 9. 

Contrary to what you might expect due to the symmetry between Eqs. 
(7.16) and (7.51), the sample variance is not an unbiased estimator for the 
variance = Var[X] of the distribution, but is biased. The fundamental reason 
is, as mentioned above, that X is itself a random variable which is described 
by a distribution P-^. As shown in Eq. (7.52), this distribution has mean fi, 
independent of the sample size. On the other hand, the distribution has the 
variance 



£7^ = Var[X] = Var 



^ 71 — 1 

n ^ — ^ 



n 

L i=o 



i=0 



X;nYar\X\ = — 
n 



(7.53) 
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Thus, the distribution of X gets narrower with increasing sample size n. This 
has the following consequence for the expectation value of the sample variance 
which is described by the random variable S'^ = ^ Y^i^^oi-^i ~ X)'^: 



E 



^ n—1 



n 

L 1=0 



I'n-l 



E 



-y{Xf-2XiX + X^ 

1=0 

= Me ] - n E[X^] ] ^'^'^ 1 {n{a' + - n{a'^ + m^)) 

(7.54) 



2 



(7.53) 1 / 2 2 <^ 

= — [ na + nu — n na = 

n \ n ' 



n 
n-1 



This means that, although is biased, ^^s^ is an unbiased estimator for the 
variance of the underlying distribution of X. Nevertheless, also becomes 
unbiased for n — > oo.^ 

For some distributions, for instance a power-law distribution Eq. (7.41) with 
exponent 7 < 2, the variance does not exist. Numerically, when calculating 

according Eq. (7.51), one observes that it will not converge to a finite value 
when increasing the sample size n. Instead one will observe occasionally jumps 
to higher and higher values. One says the estimator is not robust. To get still 
an impression of the spread of the data points, one can instead calculate the 
average deviation 



D 



- n — 1 

-E 



(7.55) 



i=0 



In general, an estimator is the less robust, the higher the involved moments 
are. Even the sample mean may not be robust, for instance for a power-law 
distribution with 7 < 1. In this case one can use the sample median, which is the 
value Xm such that Xi < Xm for half the sample points, the (n + l)/2'th 

sample point if they are sorted in ascending order. ^ The sample median is clearly 
an estimator of the median (see Def. 7.1.2). It is more robust, because it is less 
influenced by the sample points in the tail. The simplest way to calculate the 
median is to sort all sample points in ascending order and take the sample point 
at the (n/2 + l)'th position. This process takes a running time C'(nlogn). 
Nevertheless, there is an algorithm [Press et al. (1995), Gormen et al. (2001)] 
which calculates the median even in linear running time 0{n). 



7.3.2 Confidence intervals 

In the previous section, we have studied estimators for parameters of a ran- 
dom variable X using a sample obtained from a series of independent random 

^Sometimes the sample variance is defined as 5* = ^^^^ YH=o(^i ~ maJje it an 

unbiased estimator of the variance. 

^If n is even, one can take the average between the n/2'th and the (n + l)/2'th sample 
point in ascending order. 
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experiments. This is a so-called point estimator, because just one number is 
estimated. 

Since each estimator is itself a random variable, each estimated value will be 

usually off the true value 9. Consequently, one wants to obtain an impression of 
how far off the estimate might be from the real value 9. This can be obtained 
for instance from: 

Definition The mean squared error of a point estimator H = 

h{XQ, Xi, . . . , Xn-i) for a parameter 9 is 

MSE(i?) = E[(i?-e')2] =E[(ff-E[iJ]+E[i7]-0)2] 

= E[(iJ - E[iJ])2] + E[2(iJ - E[i/])(E[7J] - 9)] + E[(E[i?] - 9f] 
= ^{H - E[F])2] + 2 (Efff] - E[ff])(E[i?] -6) + (E[iJ] - Of 

=0 

= Yav[H] + {^H]-6f (7.56) 



If an estimator is unbiased, i.e., if E[iJ] = 9, the mean squared error is 
given by the variance of the estimator. Hence, if for independent samples (each 
consisting of n sample points) the estimated values arc close to each other, the 
estimate is quite accurate. Unfortunately, usually only one sample is available 
(how to circumvent this problem rather ingeniously, see Sec. 7.3.4). Also the 
mean squared error docs not immediately provide a probabilistic interpretation 
of how far the estimate is away from the true value 9. 

Nevertheless, one can obtain an estimate of the error in a probabilistic sense. 
Here we want to calculate a so-called confidence interval also sometimes named 
error bar. 

Definition For a parameter 9 describing a random variable, two estimators 

la = la {xo , a;i , . . . , x„- 1 ) and u = (xq ,xi, . . . , a;„_ i ) which are obtained from 
a sample {xq, xi, . . . , Xn-i] provide a confidence interval if, for given confidence 
level 1 — a € (0, 1) we have 

P(/a <9 <Ua) = l-a (7.57) 

The value a G (0, 1) is called conversely significance level. This means, the true 
but unknown value 6 is contained in the interval (Z, m), which is itself a random 
variable as well, with probability \ — a. Typical values of the confidence level 
are 0.68, 0.95 and 0.99 (a = 0.32, 0.05, 0.01, respectively), providing increasing 
confidence. The more one wants to be sure that the interval really contains the 
true parameter, i.e. the smaller the value of a, the larger the confidence interval 
will be. 

Next, it is quickly outlined how one arrives at the confidence interval for the 
mean, for details please consult the specialized literature. First we recall that 
according to its definition the mean is a sum of independent random variables. 
For computer simulations, one can assume that usually (see below for a coun- 
terexample) a sufficiently large number of experiments is performed.^ Therefore, 



This is diflferent for many empirical experiments, for example, when testing new treat- 
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according to the central limit theorem 7.1.2 X should exhibit (approximately) 
a pdf /y which is Gaussian with an expectation value fi and some variance 
(7^ = cr^/n. This means, the probability a that the sample means fall outside 
an interval / = [/i — za^^, /i + zaj^] can be easily obtained from the standard 
normal distribution. This situation is shown in the Fig. 7.11. Note that the 
interval is symmetric about the mean /x and that its width is stated in multiples 
z = z{a.) of the standard deviation cr-^. The relation between significance level 
a and half interval width z is just dx f^{x) = 1 — a. Hence, the weight of 
the standard normal distribution outside the interval [—-2, 2;] is a. This relation 
can be obtained from any table of the standard Gaussian distribution or from 
the function gsl_cdf _gaussian_P() of the GNU scientific library (see Sec. 6.3). 
Usually, one considers integer values z = 1,2,3 which correspond to significance 
levels a = 0.32, 0.05, and 0.003, respectively. So far, the confidence interval / 




Figure 7.11: Probability density function of the sample mean X for large enough 

sample sizes n where the distribution becomes Gaussian. The true expectation 
value is denoted by n and a-Y is the variance of the sample mean. The probability 
that a random number drawn from this distribution falls outside the symmetric 
interval [fi — za^, fi + za-^l is a. 

contains the unknown expectation value fi and the unknown variance a-^- First, 
one can rewrite 

1 — a = P(/i — za-Y < X < fi + zaj^) 

= P{—Z(J-x ^ X — IJ' ^ z^x) 

= P{-X - zCTy < < -'^zaj^) 

= P(X - za-Y < /i < X + zcTy) . 

This now states the probability that the true value, which is estimated by the 
sample mean x, lies within an interval which is symmetric about the estimate x. 

ments in medical sciences, where often only a very restricted number of experiments can be 
performed. In this case, one has to consider special distributions, like the Student distribution. 
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Note that the width 2z(Ty is basically given by cry = y Var[X]. This explains 
why the mean squared error MSE(_ff) = Var[iJ], as presented in the beginning 
of this section, is a good measure for the statistical error made by the estimator. 
This will be used in Sec. 7.3.4. 

To finish, we estimate^ the true variance cr^ using :f^s'^, hence we get cry = 
-7= ~ -^==. To summarize we get: 




Note that this confidence interval, with 1^ = x — z(a)S/\/n — 1 and = 
X + z{a)S/\/n — 1, is symmetric about x, which is not necessarily the case for 
other confidence intervals. Very often in scientific publications, to state the 
estimate for [i including the confidence interval, one gives the range where the 
true mean is located in 68% of all cases (z = 1) i.e. x± ,^ ^ , this is called the 
standard Gaussian error bar or one a error bar. Thus, the sample variance and 
the sample size determine the error bar/ confidence interval. 

For the variance, the situation is more complicated, because it is not simply 
a sum of statistically independent sample points {a;o,a;i, . . . , a;„_i}. Without 
going into the details, here only the result from the corresponding statistics lit- 
erature [Dekking et al (2005), Lefebvre (2006)] is cited: The confidence interval 
where with probability 1 — a the true variance is located is given by [ct^^jO-^] 
where 



2 f^^^ 








2 


(7.59) 



Here, x^iPi'^) is the inverse of the cumulative chi-squared distribution with v 
degrees of freedom. It states the value where F{x^,i') = (3, see page 16. This 
chi-squared function is implemented in the GNU scientific library (see Sec. 6.3) 
in the function gsl_cdf _chisq_Pinv() . 

Note that as one alternative, you could regard yi = {xi — x) approximately 
as independent data points and use the above standard error estimate described 
for the mean of the sample {yi}. Also, one can use the bootstrap method as 
explained below (Sec. 7.3.4), which allows to calculate confidence intervals for 
arbitrary estimators. 
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7.3.3 Histograms 

Sometimes, you do not only want to estimate moments of an underlying distri- 
bution, but you want to get an impression of the full distribution. In this case 
you can use histograms. 

Definition A histogram is given by a set of disjoint intervals 

Bk = [lk,Uk), (7.60) 

which are called 6ms and a counter hk for each bin. For a given sample of 
n measured points {xo,xi, . . . , Xn-i}, bin hk contains the number of sample 
points Xi which are contained in Bk. 
Exeimple For the sample 

{xi} = {1.2, 1.5, 1.0, 0.7, 1.4, 2.0, 
1.5, 1.1, 0.9, 1.9, 1.2, 0.8} 

the bins 

[0,0.5), [0.5,1.0) [1.0,1.5), [1.5,2.0), [2.0,2.5) [2.5,3.0), 
are used, resulting in 

hi = 0, h2 = 3, hs = 5, /i4 = 3, h^ = 1, h^ = 
which is depicted in Fig. 7.12. 

6 1 \ \ \ \ I I 

5- n 

4- 

■«~3- n n 



2- 
1- 




Figure 7.12: Histogram for the data shown in Ex. 7.3.3. 

In principle, the bins can be chosen arbitrarily. You should take care that 
the union of all intervals covers all (possible or actual) sample points. Here, it is 
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assumed that the bins are properly chosen. Note also that the width bk = Uk — lk 
of each bin can be different. Nevertheless, often bins with uniform width are 
used. Furthermore, for many applications, for instance, when assigning different 
weights to different sample points^, it is useful to consider the counters as real- 
valued variables. A simple (fixed-bin width) C implementation of histograms is 
described in Sec. 3.2. The GNU scientific library (see Sec. 6.3) contains data 
structures and functions which implement histograms allowing for variable bin 
width. 

Formally, for a given random variable X, the count hk in bin k can be seen as 
a result of a random experiment for the binomial random variable Hk ^ B{n, pk) 
with parameters n and pk, where pk = P{X G B^) is the probability that a 
random experiment for X results in a value which is contained in bin B^. This 
means that confidence intervals for a histogram bin can be obtained in principle 
from a binomial distribution. Nevertheless, for each sample the true value for 
a value pk is unknown and can only be estimated by qk = hk/n. Hence, the 
true binomial distribution is unknown. On the other hand, a binomial random 
variable is a sum of n Bernoulli random variables with parameter pk- Thus, the 
estimator qk is nothing else than a sample mean for a Bernoulli random variable. 
If the number of sample points n is "large" (see below), from the central limit 
theorem 7.1.2 and as discussed in Sec. 7.3.2, the distribution of the sample mean 
(being binomial in fact) is approximately Gaussian. Therefore, one can use the 
standard confidence interval Eq. (7.58), in this case 



Here, according Eq. (7.19), the Bernoulli random variable exhibits a sample 
variance = qk{l — qk) = (hk/n){l — hk/n). Again, z = z{a) denotes the 
half width of an interval [—z,z] such that the weight of the standard normal 
distribution outside the interval equals a. Hence, the estimate with standard 
error bar {z = 1) is qk ± v^gfe(r^^gfe)7("^^^- 

The question remains: What is "large" such that you can trust this "Gaus- 
sian" confidence interval? Consider that you measure for example no point at all 
for a certain bin Bk- This can happen easily in the regions where pk is smaller 
than 1/n but non-zero, i.e. in regions of the histogram which are used to sample 
the tails of a probability density function. In this case the estimated fraction 
can easily he qk = resulting also in a zero-width confidence interval, which is 
certainly wrong. This means, the number of samples n needed to have a reliable 
confidence interval for a bin Bk depends on the number of bin entries. A rule 
of thumb from the statistics literature is that nqk{l — qk) > 9 should hold. If 
this condition is not fulfilled, the correct confidence interval [qi^i , qt^u] for qk has 
to be obtained from the binomial distribution and it is quite complicated, since 
it uses the F distribution (see Def. 7.1.2 on page 16) 

^This occurs for some advanced simulation techniques. 




(7.61) 
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hk + {n-hk + l)Fi 

{hk + l)F2 
{hk + 1)^2 + n - /ifc ' 
where J'l = F{1 - a/2; 2n - 2hk + 2, 2hk) 
F2 = F{1 - a/2; 2hk + 2,2n- 2hk) 

The value F{f3; n, r^) states the x value such that the distribution function 
for the F distribution with number of degrees ri and r2 reaches the value (3. 
This inverse distribution function is implemented in the GNU scientific library 
(see Sec. 6.3). If you always use these confidence intervals, whieli are usually 
not symmetric about qk, then you cannot go wrong. Nevertheless, for most 
applications the standard Gaussian error bars are fine. 

Finally, in case you want to use a histogram to represent a sample from a 
continuous random variable, you can easily interpret a histogram as a sample 
for a probability density function, which can be represented as a set of points 
{(.'Efc,p(.ife))}. This is called the histogram pdf or the sample pdf. For sim- 
plicity, it is assumed that the interval mid points of the intervals are used as 
a;-coordinate. For the normalization, we have to divide by the total number of 
counts, as for qk = hk/n and to divide by the bin width. This ensures that the 
integral of the sample pdf, approximated by a sum, gives just unity. Therefore, 
we get 

Xk = {lk + Uk)/2 
p(xk) = hk/inbk). (7.63) 

The confidence interval, whatever type you choose, has to be normalized in the 
same way. A function which prints a histogram as pdf, with simple Gaussian 
error bars, is shown in Sec. 3.2. 

For discrete random variables, the histogram can be used to estimate the 
pmf.^. In this case the choice of the bins, in particular the bin widths, is 
easy, since usually all possible outcomes of the random experiments are known. 
For a histogram pdf, which is used to describe approximately a continuous 
random variable, the choice of the bin width is important. Basically, you have 
to adjust the width manually, such that the sample data is respresented "best". 
Thus, the bin width should not be too small nor too large. Sometimes a non- 
uniform bin width is the best choice. In this case no general advice can be given, 
except that the bin width should be large where few data points have been 
sampled. This means that each bin should contain roughly the same number 
of sample points. Several different rules of thumb exist for uniform bin widths. 
For example b = S.^dSn''^/^ [Scott (1979)], which comes from minimizing the 
mean integrated squared difference between a Gaussian pdf and a sample drawn 

^For discrete random variables, the values are already suitably normalized 



(7.62) 
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from this Gaussian distribution. Hence, the larger the variance S of the sample, 
the larger the bin width, while increasing the number of sample points enables 
the bin width to be reduced. 

In any case, you should be aware that the histogram pdf can be only an 
approximation of the real pdf, due to the finite number of data points and due 
to the underlying discrete nature resulting from the bins. The latter problem has 
been addressed in recent years by so-called kernel estimators [Dekking et al (2005)]. 
Here, each sample point Xi is represented by a so-called kernel function. A kernel 
function k{x) is a peaked function, formally exhibiting the following properties: 

• It has a maximum at 0. 

• It falls off to zero over some some distance h. 

• Its integral J k{x) dx is normalized to one. 

Often used kernel functions are, e.g., a triangle, a cut upside-down parabola or 
a Gaussian function. Each sample point xt is represented such that a kernel 
function is shifted having the maximum at Xi. The estimator p{x) for the pdf 
is the suitably normalized sum (factor 1/n) of all these kernel functions, one for 
each sample point: 



The advantages of these kernel estimators are that they result usually in a 
smooth function p and that for a value p{x) also sample points more distant 
from X may contribute, with decreasing weight for increasing distance. The 
most important parameter is the width h, because too small a value of h will 
result in many distinguishable peaks, one for each sample point, while too large 
value a of ft, leads to a loss of important details. This is of similar importance 
as the choice of the bin width for histograms. The choice of the kernel function 
(e.g. a triangle, an upside-down parabola or a Gaussian function) seems to be 
less important. 

7.3.4 Resampling using Bootstrap 

As pointed out, an estimator for some parameter 6, given by a fimction 
h{xo,Xi, . . . ,Xn-i), is in fact a random variable H = h{Xo, Xi, . . . , Xn^i). 
Consequently, to get an impression of how much an estimate differs from the 
true value of the parameter, one needs in principle to know the distribution 
of the estimator, e.g. via the pdf pn or the distribution function Fh. In the 
previous chapter, the distribution was known for few estimators, in particular 
if the sample size n is large. For instance, the distribution of the sample mean 
converges to a Gaussian distribution, irrespectively of the distribution function 
Fx describing the sample points {xi}. 

For the case of a general estimator H, in particular if Fx is not known, 
one may not know anything about the distribution of H. In this case one can 
approximate Fx by the sample distribution function: 




(7.64) 
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Definition For a sample {xq, xi, . . . , Xn^i}, the sample distribution function 
(also called empirical distribution function) is 

^ , , number of sample points Xi smaller than or equal to x 

F^{x) = (7.65) 

Note that this distribution function describes in fact a discrete random vari- 
able (called X here), but is usually (but not always) used to approximate a 
continuous distribution function. 

The bootstrap principle is to use F-^ instead of Fx ■ The name of this principle 
was made popular by B. Efron [Efron (1979), Efron and Tibshirani (1994)] and 
comes from the fairy tale of Baron Miinchhausen, who dragged himself out of a 
swamp by pulling on the strap of his boot.® Since the distribution function Fx 
is replaced by the empirical sample distribution function, the approach is also 
called em,pirical bootstrap, for a variant called parametric bootstrap see below. 

Now, having F-^ one could in principle calculate the distribution function 
F^ for the random variable H = h{Xo, Xi,. . . , X„_i) exactly, which then is an 
approximation of Fh- Usually, this is to cumbersome and one uses a second 
approximation: One draws so-called bootstrap samples {a;o,i;i, . • . ,x„-i} from 
the random variable X. This is called resampling. This can be done quite 
simply by n times selecting {with replacement) one of the data points of the 
original sample {xi}, each one with the same probability 1/n. This means 
that some sample points from {xi} may appear several times in {xi}, some 
may not appear at all.^ Now, one can calculate the estimator value h* = 
h{xo,Xi, . . . ,Xn-i) for each bootstrap sample. This is repeated K times for 
different bootstrap samples resulting in K values {k = 1,...,K) of the 
estimator. The sample distribution fimction Fh' of this sample {hi} is the final 
result, which is an approximation of the desired distribution function Fh- Note 
that the second approximation, replacing Fj^ by Fh* can be made arbitrarily 
accurate by making K as large as desired, which is computationally cheap. 

You may ask: Does this work at all, i.e., is Fh' a good approximation 
of Fff? For the general case, there is no answer. But for some cases there 
are mathematical proofs. For example for the mean H = X the distribution 
function F^' in fact converges to F-^. Here, only the subtlety arises that one 
has to consider in fact the normalized distributions of X — ^ (a* = E[X]) and 
X — X {x = Xi/n). Thus, the random variables arc just shifted by constant 

values. For other cases, like for estimating the median or the variance, one has 
to normalize in a different way, i.e., by subtracting the (empirical) median or by 
dividing by the (empirical) variance. Nevertheless, for the characteristics of Fh 
we are interested in, in particular in the variance, see below, normalizations like 
shifting and stretching are not relevant, hence they are ignored in the following. 
Note that indeed some estimators exist, like the maximum of a distribution, for 
which one can prove conversely that Fh' does not converge to Fh, even after 



^In the European version, he dragged himself out by pulling his hair. 

^The probability for a sample point not to be selected is (1 — 1/n)" = exp(nlog(l — 1/n)) 



exp(n(— 1/n)) = exp(— 1) Ri 0.367 for n — » oo. 
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some normalization. On the other hand, for the purpose of getting a not too 
bad estimate of the error bar, for example, bootstrapping is a very convenient 
and suitable approach which has received high acceptance during recent years. 

Now one can use Fh* to calculate any desired quantity. Most important is 
the case of a confidence interval [hi , hu] such that the total probability outside 
the interval is a, for given significance level a, i.e. Fh» {hu) — FH* {hi) = 1 — a. In 
particular, one can distribute the weight a equally below and above the interval, 
which allows to determine hi , hu 

FH^{hu)^FH'{hi) = a/2. (7.66) 

Similar to the confidence intervals presented in Sec. 7.3.2, [hi, hu] also represents 
a confidence interval for the unknown parameter 6 which is to be estimated from 
the estimator (if it is unbiased). Note that [hi,hy\ can be non-symmetric about 
the actual estimate h{xo,xi, . . . , Xn-i)- This will happen if the distribution Fh* 
is skewed. 

For simplicity, as wc have seen in Sec. 7.3.2, one can use the variance Var[iJ] 
to describe the statistical uncertainty of the estimator. As mentioned on page 
32, this corresponds basically to a a = 0.32 uncertainty. 

The following C fimction calculates Ya.r[H*], 
as approximation of the unknown Var[i?]. One 
has to pass as arguments the number n of sample 
points, an array containing the sample points, the 
number K of bootstrap iterations, and a pointer 



GET SOURCE CODE 



DIR: randomness 
FILE(S): bootstrap. c 
bootstrap-test . c 



to the function f which represents the estimator, f has to take two arguments: 

the number of sample points and an array containing a sampk\ For an expla- 
nation of function pointers, see Sec. 1.4. The function bootstrap_variance () 
returns Yar[H*]. 

double bootstrap_variance(int n, double *x, int n_resainple, 

double (*f)(int, double *)) 

{ 

double *xb; /* bootstrap sample */ 

double *h; /* results from resampling */ 

int sample, i; /♦ loop counters ♦/ 

int k; /* sample point id */ 

double var; /* result to be returned */ 

h = (double *) malloc (n.resample * sizeof (double)) ; 
xb = (double *) malloc(n * sizeof (double)) ; 
f or(sajnple=0; sample<n_resample ; sample++) 
-C 

for(i=0; i<n; i++) /* resample */ 

i 

k = (int) f loor(drand48()*n) ; /* select random point */ 

xb [i] = X [k] ; 

} 

hCsample] = f(n, xb) ; /* calculate estimator */ 

} 
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var = variance (n_resample , h.) ; /* obtain bootstrap variance */ 

free (h) ; 
free (xb) ; 
retiirn(var) ; 

} 

The bootstrap samples {xi} are stored in the array xb, while the sampled es- 
timator values {h^.} arc stored in the array h. These arrays are allocated in lines 
10-11. In the main loop (lines 12-20) the bootstrap samples are calculated, each 
time the estimator is obtained and the result is stored in h. Finally, the vari- 
ance of the sample {hi} is calculated (line 22). Here, the function varianceO 
is used, which works similarly to the function meanO, see exercise (3). Your 
are asked to implement a bootstrap function for general confidence interval in 
exercise (4). 

The most obvious way is to call bootstrap_variaiice() with the estimator 
mean as forth argument. For a distribution which is "well behaved" (i.e., where 
a sum of few random variables resembles the Gaussian distribution), you will 
get a variance that is, at least if n_resaiiiple is reasonably large, very close to 
the standard Gaussian (a = 0.32) error bar. 

For calculating properties of the sample mean, the bootstrap approach works 
fine, but in this case one could also be satisfied with the standard Gaussian con- 
fidence interval. The bootstrap approach is more interesting for non-standard 
estimators. One prominent example from the field of statistical physics is the 
so-called Binder cumulant [Binder (1981)], which is given by: 

b{xo,xi,...,Xn-i) = 0.5 ^3-^^ (7.67) 

where TTT is again the sample mean, for exam- 
ple x-^ = Y^^i=o "^i ■ '^^^ Binder cumulant is often 
used to determine phase transitions via simula- 
tions, where only systems consisting of a finite 
number of particles can be studied. For exam- 
ple, consider a ferromagnetic system held at some 
temperature T. At low temperature, below the 
Curie temperature T^, the system will exhibit a macroscopic magnetization m. 
On the other hand, for temperatures above T^, m will on average converge to 
zero when increasing the system size. This transition is fuzzy, if the system 
sizes are small. Nevertheless, when evaluating the Binder cumulant for different 
sets of sample points {m(T. L)i} which are obtained at several temperatures T 
and for different system sizes L, the bL{T) curves for different L will all cross 
[Landau and Binder (2000)] (almost) at Tc, which allows for a very precise de- 
termination of Tc. A sample result for a two-dimensional (i.e. layered) model 
ferromagnet exhibiting L x L particles is shown in Fig. 7.13. The Binder cumu- 
lant has been useful for the simulation of many other systems like disordered 
materials, gases, optimization problems, liquids, and graphs describing social 
systems. 



GET SOURCE CODE 

DIR: randomness 
FILE(S): 
binder_L8 . dat 
binder _L10 . dat 
binder _L16 . dat 
binder_L30 . dat 
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T 

Figure 7.13: Plot of Binder cumulant of two-dimensional model ferromagnet as 
function of temperature T (dimensionless units). Each system consists of L x L 
particles. The curves for different system sizes L cross very close to the phase 
transition temperature Tc — 2.269 (known from analytical calculations of this 
model). The error bars shown can be obtained using a bootstrap approach. 



A confidence interval for the Binder cumulant is very difficult (or even 
impossible) to obtain using standard error analysis. Using bootstrapping, it 
is straightforward. You can use simply the function bootstrap_variance () 
shown above while providing as argument a function which evaluates the Binder 
cumulant for a given set of data points. 

So far, it was assumed that the empirical distribution function F-^ was used 
to determine an approximation of Fh- Alternatively, one can use some addi- 
tional knwoledge which might be available. Or one can make additional assump- 
tions, via using a distribution function Fx which is parametrized by a vector 
of parameters A. For an exponential distribution, the vector would just con- 
sist of one parameter, the expectation value, while for a Gaussian distribution, 
A would consist of the expectation value and the variance. In principle, arbi- 
trary complex distributions with many parameters are possible. To make -Fa a 
"good" approximation of Fx, one has to adjust the parameters such that the 
distribution function represents the sample {xi} "best", resulting in a vector 
A of parameters. Methods and tools to perform this fitting of parameters are 
presented in Sec. 7.6.2. Using F^ one can proceed as above: Either one cal- 
culates Fjj exactly based on F^, which is most of the time too cumbersome. 
Instead, usually one performs simulations where one takes repeatedly samples 
{xo,xi, . . . ,i„_i} from simulating Fj^ and calculates each time the estimator 
h* = h{xQ,Xi, . . . ,Xn~i). This results, as in the case of the empirical boot- 
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strap discussed above, in a sample distribution function Fh* which is further 
analyzed. This approach, where Fx is used instead of F-^ , is called parametric 
bootstrap. 

Note that the bootstrap approach does not require that the sam- 
ple points are statistically independent of each other. For in- 
stance, the sample could be generated using a Markov chain Monte 
Carlo simulation [Newman and Barkema (1999), Landau and Binder (2000), 
Robert and Casella (2004), Liu (2008)], where each data point ajj+i is calcu- 
lated using some random process, but also depends on the previous data point 
Xi. More details on how to quantify correlations are given in Sec. 7.5. Never- 
theless, if the sample follows the distribution Fx, everything is fine when using 
bootstrapping and for example a confidence interval will not depend on the frac- 
tion of "independent" data points. One can see this easily by assuming that you 
replace each data point in the original sample {xi} by ten copies, hence making 
the sample ten times larger without adding any information. This will not affect 
any of the following bootstrap calculations, since the size of the sample does not 
enter explicitly. The fact that bootstrapping is not susceptible to correlations 
between data points is in contrast to the classical calculation of confidence in- 
tervals explained in Sec. 7.3.2, where independence of data is assumed and the 
number of independent data points enters formulas like Eq. (7.58). Hence, when 
calculating the error bar according to Eq. (7.58) using the ten-copy sample, it 
will be incorrectly smaller by a factor a/TO, since no additional information is 
available compared to the original sample. 

It should be mentioned that bootstrapping is only one of several resampling 
techniques. Another well known approach is the jackknife approach, where 
one does not sample randomly using F-^ or a fitted F\. Instead the sample 
{xi} is divided into B blocks of equal size ni, = n/B (assuming that n is a 
multiple of B). Note that choosing B = n is possible and not uncommon. 
Next, a number B of so-called jackknife samples b = 1, . . . , B are formed from 
the original sample {xi} by omitting exactly the sample points from the 6'th 
block and including all other points of the original sample. Therefore, each of 
these jackknife samples consists of n — nf, sample points. For each jackknife 
sample, again the estimator is calculated, resulting in a sample {h'j^^} of size 
B. Note that the sample distribution function F^^^ of this sample is not an 
approximation of the estimator distribution function F^l Nevertheless, it is 
useful. For instance, the variance Var[iJ] can be estimated from {B—1)S^, where 

5^ is the sample variance of {^1^^}. No proof of this is presented here. It is just 
noted that when increasing the number B of blocks, i.e., making the different 
jackknife samples more alike, because fewer points are excluded, the sample of 
estimators values {h'"/^} will fluctuate less. Consequently, this dependence on 
the number of blocks is exactly compensated via the factor {B — 1). Note that 
for the jackknife method, in contrast to the boostrap approach, the statistical 
independence of the original sample is required. If there are correlations between 
the data points, the jackknife approach can be combined with the so-called 
blocking method [Flyvbjerg (1998)]. More details on the jackknife approach 
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can be found in [Efron and Tibshirani (1994)]. 

Finally, you should be aware that there are cases where resampling ap- 
proaches clearly fail. The most obvious example is the calculation of confidence 
intervals for histograms, see Sec. 7.3.3. A bin which exhibits no sample points, 
for example, where the probability is very small, will never get a sample point 
during resampling either. Hence, the error bar will be of zero width. This is 
in contrast to the confidence interval shown in Eq. 7.62, where also bins with 
zero entries exhibit a finite-size confidence interval. Consequently, you have to 
think carefully before deciding which approach you will use to determine the 
reliability of your results. 

7.4 Data plotting 

So far, you have learned many methods for analyzing data. Since you do not 
just want to look at tables filled with numbers, you should visualize the data 
in viewgraphs. Those viewgraphs which contain the essential results of your 
work can be used in presentations or publications. To analyze and plot data, 
several commercial and non-commercial programs are available. Here, two free 
programs are discussed, gnuplot, and xmgrace. Gnuplot is small, fast, allows 
two- and three-dimensional curves to be generated and transformed, as well as 
arbitrary functions to be fitted to the data (see Sec. 7.6.2). On the other hand 
xmgrace is more flexible and produces better output. It is recommended to use 
gnuplot for viewing and fitting data online, while xm,grace is to be preferred for 
producing figures to be shown in presentations or publications. 

7.4.1 gnuplot 

The program gnuplot is invoked by entering gnuplot in a shell, or from a menu 
of the graphical user interface of your operating system. For a complete manual 
see [Texinfo]. 

As always, our examples refer to a UNIX window system like XI 1, but the 
program is available for almost all operating systems. After startup, in the 
window of your shell or the window which pops up for gnuplot the prompt 
(e.g. gnuplot>) appears and the user can enter commands in textual form, 
results are shown in additional windows or are written into files. For a general 
introduction you can type just help. 

Before giving an example, it should be pointed out that gnuplot scripts can 
be generated by simply writing the commands into a file, e.g. command . gp, and 
calling gnuplot command. gp. 

The typical case is that you have available a 
data file of a; — y data or with x — y — dy data 
(where dy is the error bar of the y data points). 
Your file might look like this, where the "energy" 
Co of a system^ is stored as a function of the "system size" L. The filename 



GET SOURCE CODE 

DIR: randomness 
FILE(S): sg_eO_L.dat 



It is the ground-state energy of a three-dimensional ±J spin glass , a protypical system 
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is sg_eO_L.dat. The first column contains the L values, the second the energy 
values and the third the standard error of the energy. Please note that lines 
starting with "#" are comment lines which are ignored on reading: 

# ground state energy of +-J spin glasses 

# L e_0 error 

3 -1.6710 0.0037 

4 -1.7341 0.0019 

5 -1.7603 0.0008 

6 -1.7726 0.0009 
8 -1.7809 0.0008 

10 -1.7823 0.0015 
12 -1.7852 0.0004 
14 -1.7866 0.0007 

To plot the data enter 
gnuplot> plot "sg_eO_L.dat" with yerrorbars 

which can be abbreviated as p "sg_eO_L.dat" w e. Please do not forget the 
quotation marks around the file name. Next, a window pops up, showing the 
result, see Fig. 7.14. 
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Figure 7.14: Gnuplot window showing the result of a plot command. 

For the plot command many options and styles are available, e.g. with 
lines produces lines instead of symbols. It is not explained here how to set 

in statistical physics. These spin glasses model the magnetic behavior of alloys like iron-gold. 
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line styles or symbol sizes and colors, because this is usually not necessary for a 
quick look at the data. For "nice" plots used for presentations, we recommend 
xmgrace, see next section. Anyway, help plot will tell you all you have to know 
about the plot command. 

Among the important options of the plot command is that one can specify 
ranges. This can be done by specifying the range directly after the command, 
e.g. 

gnuplot> plot [7:20] "sg_eO_L.dat" with yerrorbairs 

will only show the data for x G [7,20]. Also an additional x range can be 
specified like in 

plot [7 : 20] [- 1 . 79 : - 1 . 77] " sg_eO_L . dat " with yerrorbar s 

If you just want to set the y range, you have to specify [ ] for the x-range. 
You can also fix the ranges via the set xrange and the set yrange commands, 
such that you do not have to give them each time with the plot command, see 
help set xrange or help unset xrange for unsetting a range. 

Gnuplot knows a lot of built-in functions like sin(x), log(a;), powers, roots, 
Bessel functions, error function,^ and many more. For a complete list type 
help functions. These function can be also plotted. Furthermore, using these 
functions and standard arithmetic expressions, you can also define your own 
functions, e.g. you can define a function ft (x) for the Fischer-Tippett pdf (see 
Eq. (7.43)) for parameter A (called lambda here) and show the function via 

gnuplot > ft (x)=lambda*exp(-lainbda*x) *exp(-exp(-lambda*x) ) 
gnuplot> lambda=1.0 
grLuplot> plot ft(x) 

You can also include arithmetic expressions in the plot command. To plot a 
shifted and scaled Fischer-Tippett pdf you can type: 

giiuplot> plot [0:20] 0.5*ft(0.5*(x-5)) 

The Fischer-Tippett pdf has a tail which drops off exponentially. This can 
be better seen by a logarithmic scaling of the y axis. 

gnuplot> set logscale y 

gnuplot> plot [0:20] 0.5*ft(0.5*(x-5)) 

will produce the plot shown in Fig. 7.15. 

Furthermore, it is also possible to plot several functions in one plot, via sepa- 
rating them via commas, e.g. to compare a Fischer-Tippett pdf to the standard 
Gaussian pdf, here the predefined constant pi is used: 

gnuplot> plot ft(x), exp(-x*x/2)/sqrt(2*pi) 
^The error function is erf(a;) = Jq dx' exp{—x''^). 




Figure 7.15: Gnuplot window showing the result of plotting a shifted and 
rescaled Fischer-Tippett pdf with logarithmically scaled y-axis. 



It is possible to read files with multi columns via the using data modifier^ 

e.g. 

gnuplot> plot "test.dat" using 1:4:5 w e 

displays the fourth column as a function of the first, with error bars given by 
the 5th column. The elements behind the using are called entries. Within 
the using data modifier you can also perform transformations and calculations. 
Each entry, where some calculations should be performed have to be embraced 
in ( ) brackets. Inside the brackets you can refer to the different columns of 
the input file via $1 for the first column, $2 for the second, etc. You can generate 
arbitrary expressions inside the brackets, i.e. use data from different columns 
(also combine several columns in one entry), operators, variables, predefined and 
self-defined functions and so on. For example, in Sec. 7.6.2, you will see that 
the data from the sg_eO_L . dat file follows approximately a power law behavior 
eo(i) = Coo + aL^ with too ~ —1.788, a « 2.54 and h « —2.8. To visualize this, 
we want to show eo{L) — eoo as a function of L^. This is accomplished via: 

gniiplot> einf =-1 . 788 
gnuplot> b=-2.8 

gnuplot> plot "sg_eO_L.dat" u ($l**b) : ($2-einf ) 

Now the gnuplot window will show the data as a straight line (not shown, but 
see Fig. 7.23). 
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So far, all output lias appeared on the screen. It is possible to redirect the 
output, for example, to an encapsulated postscript file (by setting set terminal 
postscript and redirecting the output set output "test.eps"). When you 
now enter a plot command, the corresponding postscript file will be generated. 

Note that not only several functions but also several data files or a mixture 
of both can be combined into one figure. To remember what a plot exported to 
files means, you can set axis labels of the figure by typing e.g. set xlabel "L", 
which becomes active when the next plot command is executed. Also you can 
use set title or place arbitrary labels via set label. Use the help command 
to find out more. 

Also three-dimensional plotting (in fact a projection into two dimensions) is 
possible using the splot command (enter help splot to obtain more informa- 
tion). Here, as example, we plot a two-dimensional Gaussian distribution: 

gnuplot> x0=3.0 
gnuplot> yO=-1.0 
gnuplot> sx=1.0 

gnuplot> sy=5.0 

gnuplot> gauss2d (x , y ) =exp (- (x-xO) **2/ (2*sx) - (y-yO) **2/ (2*sy) ) \ 
> /sqrt(4*pi**2*sx**2*sy**2) 
gnuplot> set xlabel "x" 
gnuplot> set ylabel "y" 

gnuplot> splot [x0-2:x0+2] [yO-4:yO+4] gauss2<i(x,y) with points 
gnuplot> 

Note that the long line containing the definition of the (two-argument) function 
gauss2d() is split up into two lines using a backslash at the end of the first line. 
Furthermore, some of the variables are used inside the interval specifications at 
the beginning of the splot command. Clearly, you also can plot data files with 
three-dimensional data. The resulting plot appearing in the output window is 
shown in Fig. 7.16. You can drag the mouse inside the window showing the 
plot, which will alter the view. 

Finally, to stop the execution of gnuplot, enter the command exit. These 
examples should give you already a good impression of what can be done with 
gnuplot. More can be found in the documentation or the online help. How to fit 
functions to data using gnuplot is explained in Sec. 7.6.2. It is also possible to 
make, with some effort, publication-suitable plots, but it is simpler to achieve 
this with xmgrace, which is presented in the following section. 

7.4.2 xmgrace 

The xmgrace (X Motiv GRaphing, Advanced Computation and Exploration of 
data) program is much more powerful than gnuplot and produces nicer output, 
commands are issued by clicking on menus and buttons and it offers WYSI- 
WYG. The xmgrace program offers almost every feature you can imagine for 
two-dimensional data plots, including multiple plots (insets), fits, fast Fourier 
transform, interpolation. The look of the plots may be altered in any kind of 
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ul8u: 57.0000, 4G.OO0O scale: 1.00000, 1.00000 



Figure 7.16: Gnuplot window showing the result of plotting a two-dimensional 
function using splot. 

way you can imagine like choosing fonts, sizes, colors, symbols, styles for lines, 
bar charts etc. Also, you can create manifold types of labels / legends and 
it is possible to add elements like texts, labels, lines or other geometrical ob- 
jects in the plot. The plots can be exported to various format, in particular 
encapsulated postscript ( . eps) Advanced users also can program it or use it for 
real-time visualization of simulations. On the other hand, its handling is a little 
bit slower compared to gnuplot and the program has the tendency to fill your 
screen with windows. For full information, please consult the online help, the 
manual or the program's web page [xmgrace]. 

Here, just the main steps to produce a simple but nice plot are shown and 
some further directions are mentioned. You will be given here the most impor- 
tant steps to create a similar plot to the first example, shown for the gnuplot 
program, but ready for publication. First you have to start the program by 
typing xmgrace into a shell (or to start it from some window/operating system 
menu). Then you choose the ^ata menu^°, next the import sub menu and finally 
the ASCII., sub sub menu. Then a "Grace:Read Set" window will pop up (see 
Fig. 7.17) and you can choose the data file to be loaded (here sg_eO_L . dat) [A], 
the type of the input file (Single Set) [B], the format of the data (XYDY) [C]. 
This means you have three columns, and the third one is an error bar for the 
second. Then you can hit on the OK button [E]. The data will be loaded and 



'^The underlined character appears also in the menu name and refers to the key one has to 
hit together with Alt button, if one wants to open the menu via key strokes. 
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Figure 7.17: The Grace:Read Set window of the xmgrace program. Among 
others, you can select a file [A], choose the type of the input file [B], choose the 
format of the data [C], what axes should be rescaled automatically on input [D]. 
You can actually load the data by hitting on the OK button [E] and closing the 
window by hitting on the Cancel button [F]. 



shown in the main window (see Fig. 7.18). The axis ranges have been adjusted 
to the data, because the "Autoscale on read" is set by default to "XY" [D]. You 
can quickly change the part of the data shown by the buttons (magnifier, AS, 
Z, z, j, \) on the left of the main window just below the Draw button. 

Note that another important input file type is "Block data" where the files 
consist of many columns of which you only want to show some. When you hit 
th OK button [E], another window (Grace:Edit block data) will pop up, where 
you have to select the columns which you actually want to display. For the data 
format (also when loading block data), some other important choices are XY 
(no error bars) and XYDYDY (full confidence interval, maybe non-symmetric). 
Finally, you can close the file selection window, by hitting on the Cancel button 
[F]. The OK and Cancel buttons are common to all xmgrace windows and will 
not be mentioned explicitly in the following. 

In the form the loaded data is shown by default, it is not suitable for publi- 
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cation purposes, because the symbols, lines and fonts are usually too small/ too 
thin. To adjust many details of your graph, you should go to the Plot menu. 
First, you choose the Plot appearance... sub menu. A corresponding window 
will pop up. Here, you should just unselect the "Fill" toggle box (upper right 
corner), because otherwise the bounding box included in the .eps file will not 
match the plot and your figure will overwrite other parts of e.g. your manuscript. 
The fact that your plot has no background now becomes visible through the ap- 
pearance of some small dots in the main xmgrace window, but this does not 
disrupt the output when exporting to .eps. 

Next, you choose the Set appearance... sub menu from the Plot menu. The 
corresponding window will pop up, see Fig. 7.19. You can pop this window also 
by double-clicking inside the graph. This window allows to change the actual 
display style of the data. You have to select the data set or sets [A] to which 
the changes will be be applied to when hitting the Apply button at the lower 
left of the window. Note that the list of sets in this box will contain several 
sets if you have imported more than one data set. Each of them can have (and 
usually should) its own style. The box where the list of sets appears is also used 
to administrate the sets. If you hit the right mouse button, while the mouse 
pointer is inside this box, a menu will pop up, where you can for instance copy 
or delete sets, hide or unhide them, or rearrange them. 
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Figure 7.19: The Grace:Set Appearance window of the xmgrace program. First 
you have to select the set or sets which should be addressed by the changes 
[A]. Due to the large amount of adjustable parameters, the window is organized 
into different tabs. The most import one is "Main" [B], which is shown here. 
Among others, you can select a symbol type [C] (below: symbol size, symbol 
color), choose the width of the lines [D] (also: line type, style) and the color [E]. 
Furthermore, the label for this data appearing in the legends can be states [F] . 



The options in this window are arranged within different tabs, the most 
important is the "Main" tab [B]. Here you can choose whether you want to 
show symbols for your data points and which type [C], also the symbol sizes 
and colors. If you want to show lines as well (Line properties area at the right), 
you can choose the style like "straight" and others, but also "none" is no lines 
should be displayed. The style can be full, dotted, dashed, and different dotted- 
dashed styles. For presentations and publications it is important that lines are 
well visible, in this example a line width of 2 is chosen [D] and a black color 
[E]. For presentations you can distinguish different data sets also by different 
colors, but for publications in scientific journals you should keep in mind that 
the figures are usually printed in black and white, hence light colors are not 
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visible. ^-'^ 

Each data set can have a legend (see below how to activate it). Here, the 
legend string can be stated. You can enter it directly, with the help of some 
formatting commands which are characters preceded by a backslash \ . The 

most important ones are 

• \\ prints a backslash. 

• \0 selects the Roman font, which is also the default font. A font is active 
until a new one is chosen. 

• \1 selects the italic font, used in equations. 

• \x selects a symbol font, which contains e.g. Greek characters. For ex- 
ample \xabchqL will generate Q!/3x??^A, just to mention some important 
symbols. 

• \s generates a subscript, while \N switches back to normal. For example 
\xb\s2\N\l (x) generates /?2(a;). 

• \S generates a superscript, for instance \lA\S3x\N-5 generates A^^ — 5. 

• The font size can be changed with \+ and \-. 

• With \o and \0 one can start and stop overlining, respectively, for instance 
\lA\oBC\OD generates ABCD. Underlining can be controlled via \u and 
\U. 

By default, error bars are shown (toggle box lower right corner). At least 
you should increase the line width for the symbols (Symbols tab) and increase 
the base and rise line widths for error bars (Error bars tab). 

You should know that, when you are creating another plot, you do not have 
to redo all these and other adjustments of styles. Once you have found your 
standard, you can save it using the Save Parameters... sub menu from the Plot 
menu. You can conversely load a parameter set via the Load Parameters... sub 
menu of the same menu. 

Next, you can adjust the properties of the axes, by choosing the Set appear- 
ance... sub menu from the Plot menu or by double-clicking on an axis. The 
corresponding window will pop up, see Fig. 7.20. You have to select the axis 
where the current changes apply to [A]. For the x axis you should set the range 
in the fields Start [B] and Stop [C], here to the values 1 and 15. Below these 
two fields you find the important Scale field, where you can choose linear scaling 
(default), logarithmic or reciprocal, to mention the important ones. 

The most important adjustments you can perform within the Main tab [D]. 
Here you enter the label shown below the axis in the Label string field [E]. The 
format of the string is the same as for the data set legends. Here you enter just 

Acting as referee reading scientific papers submitted to journals, I experienced many times 
that I could not recognize or distinguish some data because they were obviously printed in a 
light color, or with a thin line width, or with tiny symbols .... 
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Figure 7.20: The Grace: Axes window of the xmgrace program. First you have 
to select the axis which should be addressed by the changes [A]. Among others, 
you can change the range in the Start [B] and Stop [C] fields. Here the Main 
tab [D] is shown. You can enter an axis label in the Label string field [E] and 
select the spacing of the major and minor ticks [F,G] 



\1L, which will show as L. The major spacing of the major (with labels) and 
minor ticks can be chosen in the corresponding fields [F,G]. Below there is a 
Format field, where you can choose how the tick labels are printed. Among the 
many formats, the most common are General (1, 5, 10, . . .), Exponential (1 . Oe+00, 
5.0e+00, 1 . Oe+01,. . . ), and Power, which is useful for logarithmic scaled axes 
(10^, 10^, 10'^, . . .). For the tick labels, you can also choose a Precision. This and 
other fields of this tab you can leave at their standard values here. Nevertheless, 
you should also adjust the Char size of the axis labels (tab Axis label &l bar) and 
of the tick labels (tab Tick labels). For publications, character sizes above 150% 
are usually well readable. Note that in the Axis label &l bar tab, there is a 
field Axis transform where you can enter formulas to transform the axis more 
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or less arbitrarily, sec the manual for details. All tabs have many other fields, 
which are useful as well, but here we stay with the standard choices. Note that 
sometimes the Special tab is useful, where you can enter all major and minor 
ticks individually. 

To finish the design of the axes, you can perform similar changes to the y 
axis, with Start field -1.8, Stop field -1.6, Label string field \lE\sO\N(L) and 
the same character sizes as for the x axis for axis labels and tick labels in the 
corresponding tabs. Note that the axis label will be printed vertically. If you do 
not like this, you can choose the Perpendicular to axis orientation in the Layout 
field of the Axis label &l bar tab. 

Now you have already a nice graph. To show you some more of the capa- 
bilities of xmgrace, we refine it a bit. Next, you generate an inset, i.e. a small 
subgraph inside the main graph. This is quite common in scientific publica- 
tions. For this purpose, you select the underlineEdit menu and there the Arrange 
graph... sub menu. The corresponding window appears. We want to have just 
one inset, i.e. in total 2 graphs. For this purpose, you select in the Matrix region 
of the window the Cols: field to 1 and the Rows: field to 2. Then you hit on the 
Accept button which applies the changes and closes the window. You now have 
two graphs, one containing the already loaded data, the other one being empty. 
These two graphs are currently shown next to each other, one at the top and 
one at the bottom. 

To make the second graph an inset of the first, you choose the Graph appear- 
ance... sub menu from the ^lot menu. At the top a list of the available graphs is 
shown [A]. Here you select the first graph GO. You need only the Main tab [B], 
other tabs are for changing styles of titles, frames and legends. We recommend 
to choose Width 2 in the Frame tab. In the Main tab, you can choose the Type of 
graph [C], e.g. XY graph, which we use here (default). Polar graph or Pie chart. 
You only have to change the Viewport coordinates [D] here. These coordinates 
are relative coordinates, i.e. the standard full viewport including axes, labels 
and titles is [0, 1] x [0, 1]. For the main graph GO, you choose Xmin and Ymin 
0.15 and Xmax and Ymax 0.85. Note that below there is a toggle box Display 
legend [E], where you can control whether a legend is displayed. If you want to 
have a legend, you can control its position in the Leg. box tab. Now the different 
graphs overlap. This docs not bother you, because next you select graph Gl in 
the list at the top of the window. We want to have the inset in the free area of 
the plot, in the upper right region. Thus, you enter the viewport coordinates 
Xmin 0.38, Ymin 0.5, Xmax 0.8 and Ymax 0.8. 

Now the second graph is well placed, but empty. We want to show a scaled 
version of the data in the inset. Hence, you import the data again in the same 
way as explained above, while choosing Read to graph Gl in the Grace: Read 
sets window. In Sec. 7.6.2, you will see that the data follows approximately 
a power law behavior eo{L) = e^o + clL^ with Coo ~ —1.788, a ~ 2.54 and 
b « —2.8. To visualize this, we want to show eo(I/) — Coo as a function of 
V'. Hence, we want to transform the data. You choose from the Data menu the 
Transformations sub menu and there the Evaluate expression sub sub menu. Note 
that here you can also find many other transformations, e.g. Fourier transform, 



54 



A.K. Hartmann: Practical Guide to Computer Simulations 




Figure 7.21: The Grace; Graph Appearance window of the xmgrace program. At 
the top one can select to which graph changes should apply [A]. The window 
is divided into different tabs [B], here the Main tab is shown. The Type of the 
graph can be selected [C], also Title and Subtitle (empty here). The extensions 
of the graph can be selected in the Viewport area [D]. This allows to make one 
graph an inset of another. Using the Display legend toggle [E] the legend can be 
switched on and off. 



interpolation and curve fitting. Please consult the manual for details. In this 
case, the evaluateExpression window pops up, see Fig. 7.22 (if you did not close 
the windows you have used before, your screen will be already pretty populated). 
A transformation always takes the data points from one source set, applies a 
formula to all data points (or to a subset o points) and stores the result in a 
destination set. These sets can be selected at the top of the window in the 
Source [A] and Destination [B] fields for graph and set separately. Note that the 
data in the destination set is overwritten. If you want to write the transformed 
data to a new set, you can first copy an existing set (click on the right mouse 
button in the Destination Set window and choose Duplicate). In our case, we 
want to replace the data, hence you select for source and destination the data 
set from graph Gl. The transformation is entered below [C], here you first enter 
y=y+l .788 to shift the data. The you hit the Apply button at the bottom. Next 
you change the transformation to x=x~(-2.8) and hit the Apply button again. 
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— Source data filtering 

Restriction: None _i J Negated 



Apply I Accept I Close | 



Figure 7.22: The evaluateExpression window of the xmgrace program. At the 
top you can select Source [A] and Destination [B] sets of the transformation. The 
actual transformation is entered at the bottom [C]. 



When you now select the second graph by clicking into it, and hit the AS (auto 
scale) button on the left of the main window, you will see that the data points 
follow a nice straight line in the inset, which confirms the behavior of the data. 

Again you should select symbols, line stiles, and axis labels for the inset. 
Usually smaller font sizes are used here. Note that all operations always apply to 
the current graph, which can be selected for example by clicking near the corners 
of the boundary boxes of the graph (which does not always work, depending on 
which other windows are open) or by double clicking on the corresponding graph 
in the graph list in the Grace: Graph Appearance window. The final main window 
is shown in Fig. 7.23. Note that the left axis label is not fully visible. This is 
no problem when exporting the file as encapsulated postscript; everything will 
be shown. But if you do not like it, you can adjust the Xmin value of graph GO. 

Finally, if you choose the menu Window and the sub menu Drawing objects 
a window will pop up which enable many graphical elements like texts, lines, 
boxes and ellipses (again with a variety of choices for colors, styles, sizes etc.) 
tobe added/changed and deleted in your plot. We do not go into details here. 
Now you should save your plot using the File 
menu and the Save as... sub menu, e.g. with file 
name sg_eO_L.agr, where .agr is the typical 
postfix of xmgrace source files. When 



GET SOURCE CODE 

DIR: randomness 
FILE(S): sg_eO_L.dat 
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Wnm, :0.D, /home/3lej^e>d:e/book4/pic_r3ndam/eO_L.3gr 



Figure 7.23: The main xmgrace window after all adjustments have been made. 



you want to create another plot with similar layout later, it is convenient to 
start from this saved file by copying it to a new file and subsequently using 
again xmgrace to modify the new file. 

To export your file as encapsulated postscript, suitable for including it into 
presentations or publications (see Sec. 8.3), you have to choose the File menu 
and the Print setup... sub menu. In the window, which pops up, you can select 
the Device EPS. The file name will automatically switch to sg_eO_L.eps (this 
seems not to work always, in particular if you edit several files, one after the 
other, please check the file names always). Having hit on the Accept button, 
you can select the File menu and the ^rint sub menu, which will generate the 
desired output file.^^ 

Now you have a solid base for viewing and plotting, hence we can continue 
with advanced analysis techniques. You can experiment with plotting using 
xmgrace in exercise (5). 



^ Using the tool epstopdf you can convert the postcript file also to a pd/file. With other 
tools like convert or gimp you can convert to many other styles. 
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7.5 Hypothesis testing and (in-)dependence of 
data 

In the previous section, you have learned how to visuahzed data, mainly data 
resulting from the basic analysis methods presented in Sec. 7.3. In this sec- 
tion, we proceed with more elaborate analysis methods. One important way to 
analyze data of simulations is to test hypotheses concerning the results. The 
hypothesis to be tested is usually called null hypothesis Hq. Examples for null 
hypotheses are: 

(A) In a traffic system, opening a new track will decrease the mean value of the 
travel time tA^B for a connection A— >B below a target threshold ttargot ■ 

(B) Within an acquaintance network, a change of the rules describing how 
people meet will change the distribution of the number of people each 
person knows. 

(C) The distribution of ground-states energies in disordered magnets follows 
a Fisher-Tippett distribution. 

(D) Within a model of an ecological system, the population size of foxes is 
dependent on the population size of beetles. 

(E) For a protein dissolved in water at room temperature, adding a certain 
salt to the water changes the structure of the protein. 

One now can model these situations and use simulations to address the above 
questions. The aim is to find methods which tell us whetheror not, depending 
on the results of the simulations, we should accept a null hypothesis. There 
is no general approach. The way we can test Ho depends on the formulation 
of the null hypothesis. In any case, our result will again be based on a set of 
measurements, such as a sample of independent data points {xq, xi, . . . , x„-i}, 
formally obtained by sampling from random variables {Xq, Xi, . . . , X„_i} (here 
again, all described by the same distribution function Fx)- To get a solid sta- 
tistical interpretation, we use a test statistics, which is a fimction of the sample 
t = t{xo, xi, . . . , a;„_i). Its distribution describes a corresponding random vari- 
able T. This means, you can use any estimator (see page 27), which is also 
a function of the sample, as test statistics. Nevertheless, there are many test 
statistics, which usually are not used as estimators. 

To get an idea of what a test statistics t may look like, we discuss now test 
statistics for the above list of examples. For (A), one can use obviously the 
sample mean. This has to be compared to the threshold value. This will be 
performed within a statistical interpretation, enabling a null hypothesis to be 
accepted or rejected, see below. For (B) one needs to compare the distribu- 
tions of the number of acquaintances before and after the change, respectively. 
Comparing two distributions can be done in many ways. One can just compare 
some moments, or define a distance between them based on the difference in 
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area between the distribution function, just to mention two possibilities. For 
discrete random variables, the mean-squared difference is particularly suitable, 
leading to the so-called chi-squared test, see Sec. 7.5.1. For the example (C), 
the task is similar to (B), only that the empirical results are compared to a 
given distribution and that the corresponding random variables are continuous. 
Here, a method based on the maximum distance between two distribution func- 
tions is used widely, called Kolmogorov-Smirnov (KS) test (see Sec. 7.5.2). To 
test hypothesis (D), which means to check for statistical independence, one can 
record a two-dimensional histogram of the population size of foxes and beetles. 
This is compared with the distribution where both populations are assumed to 
be independent, i.e. with the product of the two single-population distribution 
functions. Here, a variant of the chi-squared test is applied, see Sec. 7.5.3. In the 
case (E), the sample is not a set of just one-dimensional numbers, instead the 
simulation results are conformations of proteins given by 3 A^— dimensional vec- 
tors of the positions {i = 1, . . . , N) of N particles. Here, one could introduce 
a method to compare two protein conformations {n^}, {n^} in the following 
way: First, one "moves" the second protein towards the first one such that the 
positions of the center of masses agree. Second, one considers the axes through 
the center of masses and through the first atoms, respectively. One rotates the 
second protein around its center of mass such that these axes become parallel. 
Third, the second protein is rotated around the above axis such that the dis- 
tances between the last atoms of the two proteins are minimized. Finally, for 
these normalized positions {rf*}, one calculates the squared difference of all 
pairs of atom positions d = ^^itf — rf which serves as test function. For 
a statistical analysis, the distribution of d for one thermally fluctuating protein 
can be determined via a simulation and then compared to the average value 
observed when changing the conditions. We do not go into further details here. 

The general idea to test a null hypothesis using a test statistics in a statistical 
meaningful way is as follows: 

1. You have to know, at least to an approximate level, the probability dis- 
tribution function Ft of the test statistics under the assumption that the 
null hypothesis is true. This is the main step and will be covered in detail 
below. 

2. You select a certain significance level a. Then you calculate an inter- 
val [ai , ay] such that the cumulative probability of T outside the interval 
equals to a, for instance by distributing the weight equally outside the 
interval via F{ai) — a/2, F{au) = 1 — a/2. Sometimes one-sided intervals 
are more suitable, e.g. [oo, a„] with F(a„) = 1 — a, see below concerning 
example (A). 

3. You calculate the actual value t of the test statistics from your simulation. 
If i G [ai,a„] then you accept the hypothesis, otherwise you reject it. 
Correspondingly, the interval [ai , o„] is called acceptance interval. 

Since this is a probabilistic interpretation, there is a small probability a that 
you do not accept the null hypothesis, although it is true. This is called a type 
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/ error (also called false negative), but this error is under control, because a is 
known. 

On the other hand, it is important to realize that in general the fact that the 
value of the test statistics falls inside the acceptance interval docs not prove that 
the null hypothesis is true! A different hypothesis Hi could indeed hold, just 
your test statistics is not able to discriminate between the two hypotheses. Or, 
with a small probability /?, you might obtain some value for the test statistics 
which is unlikely for Hi, but likely for Hq. Accepting the null hypothesis, al- 
though it is not true, is called a type II error (also called false positive). Usually, 
Hi is not known, hence (3 cannot be calculated explicitly. The different cases 
and the corresponding possibilities are summarized in Fig. 7.24. To conclude: 
If you want to prove a hypothesis H (up to some confidence level 1 — a), it is 
better to use the opposite of H as null hypothesis, if this is possible. 



^\ reality 
test \^ 

decision^\ 


Hq is true 


H.| is true 


accept Hq 


correct decision 
1-a 


type II error 
P 


reject Hg 


type 1 error 
a 


correct decision 
1-p 



Figure 7.24: The null hypothesis Hq might be true, or the alternative (usually 

unknown) hypothesis Hi. The test of the null hypothesis might result in an 
acceptance or in a rejection. This leads to the four possible scenarios which 
appear with the stated probabilities. 

Indeed, in general the null hypothesis must be suitably formulated, such that 
it can be tested, i.e. such that the distribution function Ft describing T can 
be obtained, at least in principle. For example (A), since the test statistics T 
is a sample mean, it is safe; to assume a Gaussian distribution for T: One can 
perform enough simulations rather easily, such that the central limit theorem 
applies. We use as null hypothesis the opposite of the formulated hypothesis 
(A). Nevertheless, it is impossible to calculate an acceptance interval for the 
Gaussian distribution based on the assumption that the mean is larger than 
a given value. Here, one can change the null hypothesis, such that instead an 
expectation value equal to ttargct is assumed. Hence, the null hypothesis assumes 
that the test statistics has a Gaussian distribution with expectation value ttarget- 
The variance of T is unknown, but one can use, as for the calculation of error 
bars, the sample variance divided by n — 1. Now one calculates on this basis 
an interval [a/,oo] with Fxiai) = a. Therefore, one rejects the null hypothesis 
if t < a;, which happens with probability a. On the other hand, if the true 
expectation value is even larger than ftarget) then the probability of finding a 
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mean with t < aj becomes even smaller than a, i.e. less likely. Hence, the 
hypothesis (A) can be accepted or rejected on the basis of a fixed expectation 
value. 

For a general hypothesis test, to evaluate the distribution of the test statis- 
tics T, one can perform a Monte Carlo simulation. This means one draws 
repeatedly samples of size n according to a distribution Fx determined by the 
null hypothesis. Each time one calculates the test statistics t and records a 
histogram of these values (or a sample distribution function Ff) which is an 
approximation of Ft- In this way, the corresponding statistical information can 
be obtained. To save computing time, in most cases no Monte Carlo simulations 
are performed, but some knowledge is used to calculate or approximate Ft- 

In the following sections, the cases corresponding to examples (B), (C), (D) 
are discussed in detail. This means, it is explained how one can test for equality 
of discrete distributions via the chi-squared test and for equality of continuous 
distributions via the KS test. Finally, some methods for testing concerning (in- 
)dependence of data and for quantifying the degree of dependence are stated. 

7.5.1 Chi-squ£ired test 

The chi-squared test is a method to compare histograms and discrete probability 
distributions. The test works also for discretized (also called binned) continuous 
probability distributions, where the probabilities are obtained by integrating 
the pdf over the different bins. The test comes in two variants: 

• Either you want to compare the histogram {h^} for bins Bk (sec Sec. 7.3.3) 
describing the sample {a;o, x\, . . . , Xn-i} to a given discrete or discretized 
probability mass function with probabilities {pk] = P{x G Bk). The null 
hypothesis Hq is: "t/ie sample follows a distribution given by {pk}"- 

Note that the probabilities are fixed and independent of the data sample. 
If the probabilities are parametrized and the parameter is determined by 
the sample (e.g. by the mean of the data) such that the probabilities fit the 
data best, related methods as described in Sec. 7.6.2 have to be applied. 

• Alternatively, you want to compare two histograms {hk}, {hk} obtained 
from two different samples {xq^xi, Xn-i} and {xq^xi, a;„_i} 
defined for the same bins Bk. The null hypothesis Hq is: "</ie two samples 

follow the same distribution".^^ 

In case the test is used to compare intrinsically discrete data, the intervals 
Bk can conveniently be chosen such that each possible outcome corresponds to 
one interval. Note that due to the binning process, the test can be applied to 
high-dimensional data as well, where the sample is a set of vectors. Also non- 
numerical data can be binned. In these cases each bin represents either a subset 

^^Note that here we assume that the two samples have the same size, which is usually easy 
to achieve in simulations. A different case occurs when also the number of sample points is a 
random variable, hence a difference in the number of sample points makes the acceptance of 
Ho less likely, see [Press et al. (1995)]. 
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of the high-dimensional space or, in general, a subset of the possible outcomes. 
For simplicity, we restrict ourselves here to one-dimensional numerical samples. 



Figure 7.25: Chi-squared statistics: A histogram (solid line) is compared to 
a discrete probability distribution (dashed line). For each bin, the sum of the 
squared differences of the bin counter hk to the expected number of counts npk is 
calculated (dotted vertical lines), see Eq. (7.68). In this case, the differences are 
quite notable, thus the probability that the histogram was obtained via random 
experiments from a random variable described by the probabilities {pk} (null 
hypothesis) will be quite small. 

We start with the first case, where a sample histogram is compared to a 
probability distribution, corresponding to example (C) on page 57. The test 
statistics, called x^; is defined as: 



with npk being the expected number of sample points in bin Bk- The prime at 
the sum symbol indicates that bins with hk = npk — are omitted. The number 
of contributing bins is denoted by K' . If the pmf pk is nonzero for an infinite 
number of bins, the sum is truncated for terms npk <C 1. This means that the 
number of contributing bins will be always finite. Note that bins exhibiting 
hk > but Pk — are not omitted. This results in an infinite value of x^, which 
is reasonable, because for data with hk > but pk = 0, the data cannot be 
described by the probabilities pk ■ 

The chi-squared distribution with i' = K' — 1 degrees of freedom (see Eq. 
(7.45)) describes the chi-squared test statistics, if the number of bins and the 
number of bin entries is large. The term —1 in the number of degrees of freedom 
comes from the fact that the total number of data points n is equal to the total 
number of expected data points nkPk — n Pk — hence the K' different 
summands are not statistically independent. The probability density of the chi- 
squared distribution is given in Eq. (7.45). To perform the actual test, it is 



h.,np. 





(7.68) 
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recommended to use the implementation in the GNU scientific library (GSL) 
(see Sec. 6.3). 

Next, a C function chi2_hd() is shown which | GET SOURCE CODE 

calculates the ciimulativc probability (p-value) DIR- randomness 

that a value of or larger is obtained, given FILE(S): chi2.c 
the null hypothesis that the 

sample was generated using the probabilities pk- Arguments of chi2_hd() are 
the number of bins, and two arrays h [] and p [] containing the histogram hk 
and the probabilities pk, respectively: 



double chi2_hd(int n_bins, int *h, double *p) 
{ 

int n; /* total number of sample points */ 

double chi2; /* chi*2 value */ 

int K_priine; /* number of contributing bins */ 

int i ; /* counter */ 



n = 0; 

for(i=0; i<n_bins; i++) 

n += li[i] ; /* calculate total number of sample_points */ 



clii2 = 0.0; K_prime = 0; 

for(i=0; i<n_biiis; i++) /* calculate chi"2 */ 

{ 

if(p[i] > 0) 
{ 

chi2 += (h[i]-n*p[i])*(h[i]/(n*p[i])-1.0) ; 
K_prime ++; 

} 

else if(h[i] >0) /* bin entry for zero probability ? */ 

{ 

chi2 = le60; 
K_prime ++; 

} 

} 

retiirn(gsl_cdf _chisq_Q(chi2, K_prime-1)) ; 



First, in lines 8-10, the total number of sample points is obtained from summing 
up all histogram entries. In the main loop, lines 12-25, the value of is 
calculated. In parallel, the number of contributing bins is determined. Finally 
(line 26) the p-value is obtained using the GSL function gsl_cdf_chisq_Q(). 
This p-value can be compared with the significance level a. If the the p-value 
is larger, the null hypothesis is accepted, otherwise rejected. 

Note that the result for the p-value clearly depends on the number of bins, 
and, if applicable, on the actual choice of bins. Nevertheless, all reasonable 
choices, although maybe leading to somehow different numerical results, will 
lead to the same decisions concerning the null hypothesis in most cases. 
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Next, wc consider the case, where we want to compare two histograms 
{hk}, {hk} corresponding to example (B) on page 57. In this case the statis- 
tics reads 



= I^^i^ (7.69) 
^ hk + hk 



The sum runs over all bins where hk or hk ^ 0, and K' being the c;or- 
responding number of contributing bins. Consequently, the bins which should 
be included are uniquely defined, in contrast to the case where a histogram is 
compared to a distribution defined for infinitely many outcomes. Note that in 
the denominator the sum of the bin entries occurs, not the average. The reason 
is that the chi-squared distribution is a sum of standard Gaussian distributed 
numbers (variance 1) and here, where the differences of two (approximately) 
Gaussian quantities are taken, the resulting variance is the sum of the individ- 
ual variances, approximated roughly by the histogram entries. To calculate the 
p-value, again the chi-squared distribution with v = K' — 1 degrees of freedom 
is to be applied. Here, no C implementation is shown, rather we refer the reader 
to exercise (6). 

7.5.2 Kolmogorov-Smirnov test 

Next, we consider the case where the statistical properties of a sample {xo,xi, 
. . . , .x„_i}, obtained from a repeated experiment using a continuous random 
variable, is to be compared to a given distribution function Fx- One could, 
in principle, compare a histogram and a correspondingly binned probability 
distribution using the chi-squared test explained in the previous section. Unfor- 
tunately, the binning is artificial and has an influence on the results (imagine few 
very large bins). Consequently, the method presented in this section is usually 
preferred, since it requires no binning. Note that if the distribution function 
is parametrized and if the parameter is determined by the sample (e.g. by the 
mean of the data) such that the Fx fits the data best, the methods from Sec. 
7.6.2 have to be applied. 

The basic idea of the Kolmogorov-Smirnov test is to compare the distribution 
function to the empirical sample distribution function F-^ defined in Eq. (7.65). 
Note that Fj^(x) is piecewise constant with jumps of size 1/n at the positions 
Xi (assuming that each data point is contained uniquely in the sample). 

Here again, one has several choices for the test statistics. For instance, one 
could calculate the area between Fx and Fj^. Instead, usually just the max;imum 
difference between the two functions is used: 

drm^ = raax\Fx{x) - Fj^{x)\ (7.70) 



Since the sample distribution function changes only at the sample points, 
one has to perform the comparison just before and just after the jumps. Thus, 
Eq. (7.70) is equivalent to 



64 



A.K. Hartmann: Practical Guide to Computer Simulations 



dmax = niax{|i^_x(a;i) - 1/n - Fj^{xi) \ , |Fx(xi) - F^{xi)\} 
This sample statistics is visualized in Fig. 7.26. 




Figure 7.26: Kolmogorov-Smirnov test: A sample distribution function (solid 
line) is compared to a given probability distribution function (dashed line) . The 
sample statistics dmax is the maximum difference between the two functions. 

The p-value, i.e. the probability of a value of dmax as measured (dmax*^""^^) 
or worse, given the null hypothesis that the sample is drawn from Fx, is ap- 
proximately given by (see [Press et al. (1995)] and references therein): 



P{dn 



> d 



measured \ 



Qks {[V^+0.12 + 0.11/V^]d] 



measured\ 
max / 



(7.71) 



This approximation is already quite good for n > 8. Here, the following auxiliary 
probability function is used: 



QKs(A) = 2^(-l)*+ie 



(7.72) 



GET SOURCE CODE 



DIR: randomness 
FILE(S): ks.c 



with (5ks(0) — 1 and (3ks(oo) = 0. This function 
can be implemented most easily by a direct sum- 
mation [Press et al. (1995)]. The function Q_ks() 
receives the value of A as argument and returns 
Qks(A): 



double Q_ks (double lambda) 
{ 

const double epsl = 0.0001; /* relative margin for stop */ 

const double eps2 = le-10; /* relative margin for stop */ 

int i; /* loop counter */ 

double sum; /* final value */ 

double factor; /* constant factor in exponent */ 
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double sign; 

double term, last_term; 



/* of summEmd */ 
/ * summands , last summand */ 



sum = 0.0; last .term = 0.0; sign = 1.0; 
factor = -2.0*lambda*lambda; 
for(i=l; i<100; i++) 
{ 



/* initialize */ 



/* sum up */ 



term = sign*exp(f actor*i*i) ; 
sum += term; 

if ( (fabs(term) <= epsi*fabs(last_term)) | | 
(fabs(term) <= eps2*sum)) 
return(2*sum) ; 
sign =- sign; 
last_term = term; 



The summation (lines 13-22) is performed for at most 100 iterations. If the 
current term is small compared to the previous one or very small compared to 
the sum obtained so far, the summation is stopped (line 17-18). If this does 
not happen within 100 iterations, the sum has not converged (which means A 
is very small) and Q{0) = 1 is returned. 

This leads to the following C implementation for the KS test. The function 
ks() expects as arguments the number of sample points n, the sample x[] and 
a pointer F to the distribution function: 

double ks(int n, double *x, double (*F) (double) ) 
{ 

double d, d_max; /* (maximum) distance */ 

int i; /* loop counter */ 

double F_X; /* empirical distribution function */ 

qsort(x, n, sizeof (double) , compare_double) ; 

F_X = 0; d_max =0.0; 

for(i=0; i<n; i++) /* scan through F_X */ 

■C 

d = fabs(F_X-F(x[i])) ; /* distance before jump of F_X */ 
if( d> d_max) 

d_max = d; 
F_X += 1.0/n; 

d = fabs(F_X-F(x[i] )) ; /* distance after jump of F_X */ 
if ( d> d_max) 
d_max = d; 

} 

return(Q_ks( d_max*(sqrt (n)+0 . 12+0. 11/sqrt (n) )) ) ; 



} 

return (1.0) ; 



/* in case of no convergence */ 



} 



First the sample is sorted (line 7). This allows for a simple implementation 
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of the sample distribution function, because at each sample data point, in the 
order of occurrence, the value of is increased by 1/n. When obtaining the 
maximum distance (lines 10-19), one has to compare Fj^ to the distribution 
function Fx .just before (lines 12 14) and after (lines 15-18) the jumps. Note 
that this implementation works also for samples, where some data points occur 
multiple times. 

For the actual test, one calculates the p- value for the given sample using 
ks(). If the p- value exceeds the indented significance level a, the null hy- 
pothesis is accepted, i.e. the data is compatible with the distribution with high 
probability. Usually quite small significances are used, e.g. a = 0.05. This 
means that even substantial values of dmax are accepted. Thus, one rejects the 
null hypothesis only, as usual, in case the probability for an error of type I is 
quite small. 

It is also possible to compare two samples of sizes rii, 712 via the KS test. The 
test statistics for the two sample distribution functions is again the maximum 
distance. The probability to find a value of dniax as obtained or worse, given 
the null hypothesis that the samples are drawn from the same distribution, is 
as above in Eq. (7.71), only one has to replace n by the "effective" sample 
size TT-off = nin2/{ni + 712), for details see [Press et al. (1995)] and references 
therein. It is straightforward to implement this test when using the C function 
ks() shown above as template. 



7.5.3 Statistical (in-) dependence 

Here, we consider samples, which consist of pairs (xj, yi) (i — 0, 1, . . . , n — 1) of 
data points. Generalizations to higher-dimensional data is straightforward. The 
question is, whether the yi values depend on the Xi values (or vice versa). In this 
case, one also says that they are statistically related. If yes, this means that if we 
know one of the two values, we can predict the other one with higher accuracy. 
The formal definition of statistical (in-) dependence was given in Sec. 7.1. An 
example of statistical dependence occurs in weather simiilations: The amount 
of snowfall is statistically related to the temperature: If it is too warm or too 
cold, it will not snow. This also shows, that the dependence of two variables it 
not necessarily monotonous. In case one is interested in monotonous and even 
linear dependence, one usually says that the variables are correlated, see below. 

It is important to realize that we have to dis- 
tinguish between statistical significance of a sta- 
tistical dependence and the strength of the depen- 
dence. Say that our test tells us that the x val- 
ues are statistically related with high probability. 
This usually just means that we have a large sam- 
ple. On the other hand, the strength of the sta- 
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tistical dependence can be still small. It could be, fo'r eaxample, that a giveii 
value for x will influence the probability distribution for y only slightly. One 
the other hand, the strength can be large, which means, for example, knowing 
X almost determines y. But if we have only few samples points, we cannot be 
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Figure 7.27: Scatter plots for n data points (xi,yi) wliere tlie Xi numbers are 
generated from a standard Gaussian distribution (expectation value 0, variance 
1), while each jji number is drawn from a Gaussian distribution with expectation 
value KXi (variance 1). 



very sure whether the data points are related or not. Nevertheless, there is some 
connection: the larger the strength, the easier it is to show that the dependence 
is significant. For illustration consider a sample where the x, numbers are gen- 
erated from a standard Gaussian distribution (expectation value 0, variance 1), 
while each yi number is drawn from a Gaussian distribution with expectation 
value KXi (variance 1).^^ Hence, if k = 0, the data points are independent. 
Scatter plots, where each sample point {xi, yi) is shown as dot in the x — y plane 



This is an example, where the random variables Yi which described the sample are not 
identical. 
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are exposed in Fig. 7.27. Four possibilities are presented, k = 0/1 combined 
with n = 50/5000. Below, we will also present what the methods we use here 
will tell us about these data sets. 

In this section, first a variant of the chi-squared test is presented, which 
enables us to check whether data is independent. Next, the linear correlation 
coefficient is given, which states the strength of linear correlation. Finally, it is 
discussed how one can quantify the dependence within a sample, for example 
between sample points Xi,Xi + r. 

To test statistical dependence for a sample {(a;o, yo), {x\,y\), . . . , (a;„-i, t/n-i)}, 
one considers usually the null hypothesis: Hq = "The x sample points and the y 
sample points are independent." To test Hq one puts the pairs of sample points 
into two-dimensional histograms {hki}- The counter hki receives a count, if for 
data point {xi,yi) we have Xi e b\^'' and j/i G B'f\ for suitably determined 
bins {B^j^^} and {B^^^}. Let kx and ky the number of bins in x and y direction, 
respectively. Next, one calculates single- value (or one-dimensional) histograms 
{U^'^] and defined by 

I 

= Y.^ki (7.73) 

k 

These one-dimensional histograms describe how many counts in a certain bin 
arise for one variable, regardless of the value of the other variable. It is assumed 
that all entries of these histograms are not empty. If not, the bins should be 
adjusted accordingly. Note that n = li^^^'' = lif ^ = YIm ^ki holds. 

Relative frequencies, which are estimates of probabilities, are obtained by 
normalizing with n, i.e. h^^^/n and hl^^/n. If the two variables Xi,yi are in- 
dependent, then the relative frequency to obtain a pair of values {x,y) in bins 
{Bj^^} and {B^^^} should be the product of the single- value relative frequencies. 
Consequently, by multiplying with n one obtains the corresponding expected 
number nki of counts, under the assumption that Hq holds: 

nfc,=n^^ = i^^ (7.74) 
n n n 

These expected numbers arc compared to the actual numbers in the two- 
dimensional histogram {hki} via the test statistics, comparable to Eq. (7.68): 

^2^j2 ^hki-nkir (775) 

kl 

The statistical interpretation of is again provided by the chi-squared dis- 
tribution. The number of degrees of freedom is determined by the number of 
bins {kxky) in the two-dimensional histogram minus the number of constraints. 
The constraints are given by Eq. (7.73), except that the total number of counts 
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being n is contained twice, resulting in kx + ky 
of degrees of freedom is 



1. Consequently, the number 



^ — kxky kx ky 1 . 



(7.76) 



GET SOURCE CODE 
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Therefore, under the assumption that the x and y sample points are indepen- 
dent, p = 1—F{x^, v) gives the probability (p- value) of observing a test statistics 
of or larger. F is here the distribution function of the chi-square distribution, 
see Eq. (7.45). This p- value has to be compared to the significance level a. If 
p < a, the null hypothesis is rejected. 

The following C function implements the chi- 
squared independence test chi2_indep() . It re- 
ceives the number of bins in x and y direction 
as arguments, as well as a two-dimensional array, 
which carries the histogram: 

double chi2_indep(int n_x, int n_y, int **h) 
{ 

int n; /* total number of sample points */ 

double chi2; /* chi"2 value */ 

int k_x, k_y; /* number of contributing bins */ 

int k, 1; /* counters */ 

int *hx, *hy; /* one-dimensional histograms */ 



hx = (int *) malloc (n_x*sizeof (int) ) ; /* allocate */ 

hy = (int *) malloc (n_y*sizeof (int) ) ; 

n = 0; /* calculate total niunber of sample_points */ 

for(k=0; k<n_x; k++) 
for (1=0; l<n_y; 1++) 
n += h[k] [1] ; 



k_x =0; /* calculate 1-dim histogram for x */ 

for(k=0; k<n_x; k++) 
{ 

hx[k] = 0; 

for(l=0; l<n_y; 1++) 

hx[k] += h[k] [1] ; 
if(hx[k] > 0) /* does x bin contribute ? */ 

k_x++ ; 

} 
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k_y =0; /* calculate 1-dim histogram for y */ 

for (1=0; l<n_y; 1++) 

-c 

hy[l] = 0; 

for(k=0; k<n_x; k++) 

hy[l] += h[k] [1] ; 
if(hy[l] > 0) /* does y bin contribute ? */ 

k_y++; 

} 

chi2 = 0.0; 

for(k=0; k<n_x; k++) /* calculate chi"2 */ 

for(l=0; l<n_y; 1++) 

if( (hx[k] != 0)&&(hy[l] != 0) ) 

chi2 += pow(h[k] [1] -(double) hx[k]*hy[l]/n, 2.0)/ 
( (double) hx [k] *hy [1] /n) ; 

free(hx) ; 
f ree(hy) ; 

return(gsl_cdf _chisq_C!(clii2, k_x*k_y - k_x -k_y + 1)); 

} 

First, the one-dimensional histograms are allocated (lines 9-10). Then the total 

mimbcr of coimts, i.e. the sample size, is calculated (lines 12 15). In lines 17 26, 
the one-dimensional histogram for the x direction is obtained. Also the effective 
number of bins in that direction is calculated. In lines 27-35, the same happens 
for the y direction. The actual value of the test statistics is determined in 
lines 37-42. Finally, the allocated memory is freed (lines 44-45) and the p- value 
calculated (line 46), again the GSL function gsl_cdf _chisq_Q() is used. 

The p-vahies for the sample sets shown in Fig. 7.27 are as follows; p{k = 
0,n = 50) = 0.077, p{k = 0,n = 5000) = 0.457, p{k = l,n = 50) = 0.140, 
p{k = 1, n = 5000) < 10~^°°. Hence, the null hypothesis of independence would 
not be rejected (say a = 0.05) for the case k = l,n = 50, which is actually 
correlated. On the other hand, if the number of samples is large enough, there 
is no doubt. 

Once it is established that a sample contains dependent data, one can try 
to measure the strength of dependence. A standard way is to use the linear 
correlation coefficient (also called Pearson's r) given by 

T,i{xi-x){yi-y) ^^^^^ 



This coefficient assumes, as indicated by the name, that a linear correlation 
exists within the data. The implementation using a C function is straight for- 
ward, see exercise (7). For the data shown in Fig. 7.27, the following correlation 
coefficients are obtained: r(K = 0, n = 50) = 0.009, r(K = 0, n = 5000) = 0.009, 
r{K, = l,n = 50) = 0.653, r(«; = l,n = 5000) = 0.701. Here, also in the two 
cases, where the statistics is low, the value of r reflects whether or not the data 



7.5. HYPOTHESIS TESTING AND (IN-)DEPENDENCE OF DATA 



71 



is correlated. Nevertheless, this is only the case because we compare strongly 
correlated data to uncorrelated data. If we compare weakly but significantly cor- 
related data, we will still get a small value of r. Hence, to test for significance, 
it is better to use the hypothesis test based on the test statistics. 

Finally, note that a different type of correlation may arise: So far it was 
always assumed that the different sample points Xi,Xj (or sample vectors) 
are statistically independent of each other. Nevertheless, it could be the 
case, for instance, that the sample is generated using a Markov chain Monte 
Carlo simulation [Newman and Barkema (1999), Landau and Binder (2000), 
Robert and Casella (2004), Liu (2008)], where each data point .t^+i is calcu- 
lated using some random process, but also depends on the previous data point 
Xi, hence i is a kind of artificial sample time of the simulation. This dependence 
decreases with growing time distance between sample points. One way to see 
how quickly this dependence decreases is to use a variation of the correlation 
coefficient Eq. (7.77), i.e. a correlation function: 

^ n—l—T 
C{t) = XiXi+r 

n — T f-^ 

i=0 

The term Yl'iZo ^ ^"=0 ^^^^^ converge to x"^ for n ^ oo 

if it can be assumed that the distribution of the sample points is stationary, 
i.e. does not depend on the sample time. Therefore, C{t) is approximately 
^^"3^ Y^i'^o~'^ (xi — x){xi^r — x), comparable to the nominator of the linear cor- 
relation coefficient Eq. (7.77). Usually one normalizes the correlation function 
by (5(0), which is just the sample variance in the stationary case, see Eq. (7.51): 

C{t)=C{t)/C{0). (7.79) 

Consequently, for any data, for example obtained from a Markov chain 
Monte Carlo simulation, C(0) = 1 will always hold. Then C(r) decreases with 
increasing difference r, see for example Fig. 7.28. Very often the fimctional form 
is similar to an exponential ^ exp(— t/tc). In theory, C(t) should converge to 
zero for r — ^ oo, but due to the finite size of the sample, usually strong fluctu- 
ations appear for t approaching n. A typical time which measures how fast 
the dependence of the sample points decreases is given by C(tc) = 1/e, which 
is consistent with the above expression, if the correlation function decreases 
exponentially. At twice this distance, the correlation is already substantially 
decreases (to 1/e^). Consequently, if you want to obtain error bars for sam- 
ples obtained from dependent data, you could include for instance only points 
xo,X2Tc,X4T^,xeTc, ... in a sample, or just use n/(2Tc) instead of n in any cal- 
culation of error bars. Although these error bars are different from those if the 
sample was really independent, it gives a fairly good impression of the statistical 
error. 
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Figure 7.28: Correlation function C(r) for a simulation of a ferromagnetic sys- 
tem, Xi being the magnetization at time step i. (For experts: Ising system of 
size 16 X 16 spins simulated with single-spin flip Metropolis Monte Carlo at a 
(reduced) temperature T = 2.269 close to the phase transition temperature, 
where correlation times Tc are large). 

Alternatively, to obtain a typical time Tc without calculating a correlation 

function, you can also use the blocking method [Flyvbjerg (1998)]. Within this 
approach, you iteratively merge neighboring data points via x'f'^^'^ = {x^2i + 
^2i+i)/2 and n^^~^^^ = n^^^ jl (iteration level 2; = corresponds to the original 
sample). You calculate the standard error bar cr^^V'V^'^'^' ~ 1 for each iteration 
level. Once it reaches a plateau at level Zc, the data is (almost) independent 
and the true error bar is given by the level value. Then Tc = 2^"= is a typical 
time of independence of the data points. 

If you arc really just interested in error bars, i.e. you do not need to know the 
value of Tc, you could also use the bootstrap approach which is not susceptible 
to dependence of data, see Sec. 7.3.4. 

7.6 General estimators 

In Sec. 7.3, different methods are presented of how to estimate parameters which 
can be obtained directly and simply from the given sample {a;o,Xi, . . . , a;„_i}. 
In this section, a general method is considered which enables estimators to be 
obtained for arbitrary parameters of probability distributions. The method is 
based on the maximum-likelihood principle, which is exposed in Sec. 7.6.1. This 
principle can be extended to the modeling of data, where often a sample of 
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triplets {(a;o,?;o,a'o), (a;i, j/i, (Ti), (a;„-i, (t„-i)} is given. Typically 

the Xi data points represent some control parameter, which can be chosen in 
the simulation, such as the temperature of a gas. It is assumed that all Xi 
values arc different. Consequently, the simulation has been carried out at n 
different values of the control parameter. The yi data points are averages of 
measurements (e.g. the density of the gas) obtained in the simulations for the 
fixed value Xi of the control parameter. The ai values are the corresponding error 
bars.^^ Modeling the data means that one wants to determine a relationship 
y = y{x). Usually some assumptions or knowledge about the relationship are 
available, which means one has available one parametrized test hmction ye{x). 
Consequently, the set of parameters has to be adjusted 9 such that the function 
ye_{x) fits the sample "best" . This is called data fitting and will be explained 
in Sec. 7.6.2. This approach can also be used to compare several fitted test 
functions to determined which represents the most suitable model. 

7.6.1 Mctximum likelihood 

Here, we consider the following task: For a given sample {xo,xi, . . . , Xn^i} 
and a probability distribution represented by a pmf P0{x) or a pdf fe_{x), we 
want to determine the parameters 0_ = {6i, . . . , 6n^ ) such that the pmf or pdf 
represents the data "best" . This is written in parentheses, because there is not 
unique definition what "best" means, or even a mathematical way to derive a 
suitable criterion. If one assumes no prior knowledge about the parameters, one 
can use the following principle: 

Definition The maximum-likelihood principle states that the parameters 0_ 
should be chosen such that the likelihood of the data set, given the parameters, 
is maximal. 

In case of a discrete random variable, if it can be assumed that the different 
data points are independent, the likelihood of the data is just given by the 
product of the single data point probabilities. This defines the likelihood function 

L{0) = pe{xi)pe{x2) ■ ..pi{xn-i) = [| Piixi) (7.80) 

i=0 

For the continuous case, the probability to obtain during a random exper- 
iment exactly a certain sample is zero. Nevertheless, for a small uncertainty 
parameter e, the probability to obtain a value in the interval [x — e, .x + e] is 
P{x — e<X<x + e) = J~^^ fe{x) dx ~ fs_{x)2e. Since 2e enters just as a 
factor, it is not relevant to determining the maximum. Consequently, for the 
continuous case, one considers the following likelihood function 

m) = fi{xi)fe_{x2) . . . f0_{xn-i) = n (7.81) 

i=0 

Sometimes also the Xi data points are measured quantities which are also characterized 
by error bars. The generaUzation of the methods to this case is straightforward. 
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To find the maximum of a likclitiood function L{0) analytically, one has to 
calculate the first derivatives with respect to all parameters, respectively, and 
requires them to be zero. Since calculating the derivative of a product involves 
the application of the product rule, it is usually more convenient to consider the 
log-likelihood function 

l{e) = \ogL{e). (7.82) 

This turns the product of single-data-points pmfs or pdfs into a sum, where 
the derivatives arc easier to obtain. Furthermore, since the logarithm is a 
monotonous function, the maximum of the likelihood function is the same as 
the maximum of the log-likelihood function. Hence, the parameters which suit 
"best" arc determined within the maximum-likelihood approach by the set of 
equations 

-^=0 (fc = l,...,np) (7.83) 

Note that the fact that the first derivatives arc zero only assures that an extremal 
point is obtained. Furthermore, these equations often have several solutions. 
Therefore, one has to check explicitly which solutions are indeed maxima, and 
which is the largest one. Note that maximum-likelihood estimators, since they 
are functions of the samples, are also random variables MLe^^„(Xo, . . . , X„_i). 

As a toy example, we consider the exponential distribution with the pdf 
given by Eq. (7.39). It has one parameter ji. The log-likelihood function for a 
sample {xq, Xi, . . . , Xn-i] is in this case 

n-l 



= log Yl Mxi) 



i=0 
n—1 / 

n log < — 
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Taking the derivative with respect to we obtain: 



I dL(e) -1 -n_ -n, 

= = ^ — l^ 2^ = — ( 



This implies ji = x. It is easy to verify that this corresponds to a maximum. 
Since the expectation value for the exponential distribution is just E[X] = ji, 
this is compatible with the result from Sec. 7.3, where it was shown that the 
sample mean is an unbiased estimator of the expectation value. 

If one applies the maximum-likelihood principle to a Gaussian distribu- 
tion with parameters ^, and cr^, one obtains (not shown here, see for example 
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[Dekking ct al (2005)]) as maximum- likelihood estimators the sample mean x 
(for /x) and the sample variance (for a^), respectively. This means (see Eq. 
(7.54)) that the maximum-likelihood estimator for is biased. Fortunately, 
wc know that the bias disappears asymptotically for n ^ oo. Indeed, it can be 
shown, under rather mild conditions on the underlying distributions, that all 
maximum-likelihood estimators ML0^^„(Xo, . . . ,X„_i) for a parameter 6k are 
asymptotically unbiased, i.e. 

lim E[MLe,,„] = ek (7.84) 

n — *oo 

In contrast to the exponential and Gaussian cases, for many applications 
the maximum-likelihood parameter is not directly related to a standard sample 
estimator. Furthermore, MLgj, „ can often even not be determined analytically. 
In this case, one has to optimize the log-likelihood function numerically, for ex- 
ample, using the corresponding methods from the GNU scientific library (GSL) 
(sec Sec. 6.3). 

As example, wc consider Fisher-Tippett distri- 
bution, see Eq. (7.43), shifted to exhibit the max- 
imum at xo instead of at 0. Hence, we have two 
parameters A and xq to adjust. The function to 
be optimized (the target function) , i.e. the log- likelihood function here, must be 
of a special format when using the minimization functions of the GSL. This first 
argument of the target function contains the pdf parameters to be adjusted, i.e. 
the main argument vector of the target function. This argument must be of 
the type gsl_vector, which is a GSL type for vectors. One needs to include 
<gsl/gsl_vector .h> to use this data type. These vectors are created using 
gsl_vector_alloc , set elements via gsl_vector_set(), access elements via 
gsl_vector_get() and delete the vectors via gsl_vector_free(). The usage 
of these functions should be self-explanatory from the examples below, but you 
may also have a look at the GSL documentation [Galassi et al. (2006)]. 

The second argument of the target function contains one pointer to all addi- 
tional data needed to calculate the target function, i.e. the sample in this case. 
Thus, the sample must be stored in one chunk of memory. For this purpose, we 
use the following structure type: 

typedef struct 
{ 

int n; /* nimber of sample points; */ 

double *x; /* sample */ 

} 

sample_t ; 

Since the GSL package contains actually minimization functions, while we 

are interested in a maximum, the actual log-likelihood function returns minus 
the log-likelihood. The log-likelihood function reads as follows: 
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double ll_ft (const gsl_vector *par, void *peirain) 
{ 

double lambda, xO; /* parameters of pdf */ 

sample_t *sample; /* sample */ 

double sum; /* sum of log-likelihood contributions */ 

int i; /* loop counter */ 



lambda = gsl_vector_get (par , 0) ; 
xO = gsl_vector_get(peir , 1); 
sample = (sample_t ♦) param; 



/* get data */ 



sum = sample->n*log (lambda) ; /* calculate log likelihood */ 
for(i=0; i<sample->n; i++) 

sum -= lambda* (sample->x[i]-xO) + 

exp(-lambda*(sample->x[i] -xO)) ; 

return (-sum) ; /* return - log likelihood */ 



First, we convert the pointers passed as arguments to the data format that 
we find useful (lines 8-10). Next, the actual log likelihood 

n—l n—1 

l{X,Xo) = nlogA - X'^{xi - Xo) - ^exp(-A(a;i - xq)) 

i=0 i=0 

is calculated in lines 12-15 and finally returned with inverted sign (line 17). 

The GSL has built in several minimization algorithms. They are all put 
under one of two frameworks. One framework is for algorithms which require the 
target function and its first derivatives. The other framwork contains algorithms 
where just the target function is sufficient. Here we use the simplex algorithm, 
which belongs to the latter form. It works by spanning a simplex, "'^^ evaluating 
the target functions at the corners of the simplex, and iteratively changing the 
simplex until it is very small and contains the solution. Note that the algorithm 
is only able to find local minima, and only one of them. If several minima exist, 
the choice of the initial parameters strongly influence the final results; Here, 
one maybe has to try several parameters. For details see [Galassi et al. (2006)] . 
Here wc only show how to use the minimizer. The minimizer itself is stored in a 
special data structure of type gsl_multimin_f minimizer. The target function 
has to be put into a "surrounding" variable of type gsl_multimin_function. 
Furthermore, one needs two gsl_vector variables to store the current estimate 
for the optimum (specifying the position of the simplex) and to store the size of 
the simplex. Also, par is used here to state the dimension of the target function 
argument (2) and Scunple to store the sample. 

These variables are declared as follows: 



A simplex is a convex set in an n-dimensional space generated by n + 1 corner points. 
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int num_par; /* number of parameters */ 

sample_t sample; /* sample */ 

gsl_multimiii_fminimizer *s; /* the full mimimizer */ 

gsl_vector *simplex_size; /* (relative) simplex size */ 

gsl_vector *par; /* params to be optimized = args of target */ 
gsl_iimltimin_f unction f; /* holds function to be optimized */ 

The actual allocation and initialization of these variables may look as follows: 

sample. n = 10000; /* initilization */ 

sample. X = (double *) malloc(sample.n*sizeof (double)) ; 
num.par = 2; 

f .f = &ll_ft; /* initialize minimization */ 

f . n = num_par ; 

f .params = fesample; 

simplex_size = gsl_vector_alloc(nvim_peir) ; /* alloc simplex */ 
gsl_vector_set_all(simplex_size, 1.0); /* init simplex */ 

par = gsl_vector_alloc (num_par) ; /* alloc + init arguments */ 
gsl_vector_set(par , 0, 1.0); 
gsl_vector_set(peir , 1, 1.0); 
s = 

gsl_multimin_fminimizer_alloc (gsl_multimin_fminimizer_nms implex, 

num_par) ; 

gsl_multimin_fminimizer_set (s , &f , par, simplex_size) ; 

The set-up of the minimizer object comes in two steps, first al- 
location using gsl_inultiinin_fininimizer_alloc () , then initialization via 
gsl_multimin_f minimizer_set while passing the target function, the start- 
ing point par and the (initial) simplex size.^^ The sample. x[] array has to be 
filled with the actual sample (not shown here). 

The minimization loop looks as follows: 

do /* perform minimization */ 

{ 

iter++ ; 

status = gsl_multimin_fminimizer_iterate(s) ; /* one step */ 
if (status) /* error ? */ 

break; 

size = gsl_multimin_fminimizer_size(s) ; /* converged ? */ 
status = gsl_multimin_test_size(size, le-4) ; 

} 

while( (status == GSL.CONTINUE) && (iter<i00) ); 



The main work is done in gsl_multimin_f minimizer_iterate () . Then 
it is checked whether an error has occurred. Next, the size of the simplex is 
calculated and finally tested whether the size falls below some limit, 10~* here. 

^''The simplex is spanned by par and the n vectors given by par plus (0, . . . , 0, 
simplex.size [i] , 0, . . . , 0) for i = 1, . . . , n. 
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The actual estimate of the parameters can be obtained via 
gsl_vector_get(s->x, 0) and gsl_vector_get (s->x, 1). Note that 
finally all allocated memory should be freed: 

gsl_vector_free(par) ; /* free everything ♦/ 

gsl_vector_free(simplex_size) ; 
gsl_multimin_f miniinizer_f ree (s) ; 
free (sample . x) ; 



As an example, n = 10000 data points were generated according to a Fisher- 
Tippett distribution with parameters A = 3.0, a;o = 2.0. With the above starting 
parameters, the minimization converged to the values A = 2.995 and Xo = 2.003 
after 39 iterations. 

7.6.2 Data fitting 

In the previous section, the parameters of a probability distribution are chosen 
such that the distribution describes the data best. Here, we consider a more gen- 
eral case, called modeling of data. As explained above, here a sample of triplets 
{{xo,yo,(7o), {xi,yi,ai), (a;„_i, (7„_i)} is given. Typically, the are 
measured values obtained from a simulation with some control parameter (e.g. 
the temperature) fixed at different values Xi] cTi is the corresponding error bar 
of Ui. Here, one wants to determine parameters 9— {9i, . . . , On^) such that the 
given parametrized function yeix) fits the data "best", one says one wants to 
jit the function to the data. Similar to the case of fitting a pmf or a pdf, there 
is no general principle of what "best" means. 

Let us assume that the are random variables, i.e. comparing different sim- 
ulations. Thus, the measured values are scattered around their "true" values 
ye{xi). This scattering can be described approximately by a Gaussian distribu- 
tion with mean ye(xi) and variance erf: 

qe_{yi) ~ exp |^ ^ J . (7.85) 

This assumption is often valid, e.g. when each sample point yi is itself a sample 
mean obtained from a simulation performed at control parameter value Xi, and 
CTi is the corresponding error bar. The log-likelihood function for the full data 
sample is 



n-l 

m = ^og\{qe_{yi) 
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Maximizing l{0_) is equivalent to minimizing —2l{0), hence one minimizes the 

mean-squared difference 



This moans the parameters 9 are determined such that function yg (x) follows the 
data points {{xq, yo), . . . (a;„_i, y„_i)} as close as possible, where the deviations 
are measured in terms of the error bars cTj. Hence, data points with smaller error 
bar enter with more weight. The full procedure is called least-squares fitting. 

The minimized mean-squared difference is a random variable. Note that the 
different terms are not statistically independent, since they are related by the 
rip parameters 9_ which are determined via minimizing x^. As a consequence, 
the distribution of x| is approximately given by chi-squared distribution (see 
Eq. (7.45) for the pdf) with n — Up degrees of freedom. This distribution can 
be used to evaluate the statistical significance of a least-squares fit, see below. 

In case, one wants to model the underlying distribution function for a sample 
as in Sec. 7.6.1, say for a continuous distribution, it is possible in principle to use 
the least-squares approach as well. In this case one would fit the parametrized 
pdf to a histogram pdf, which has also the above mentioned sample format 
{{xi,yi,ai)}. Nevertheless, although the least-squares principle is derived using 
the maximum-likelihood principle, usually different parameters are obtained if 
one fits a pdf to a histogram pdf compared to obtaining these parameters from 
a direct maximum-likelihood approach. Often [Bauke (2007)], the maximum- 
likelihood method gives more accurate results. Therefore, one should use a 
least-squares fit mainly for a fit of a non-pmf/non-pdf fimction to a data set. 

Fortunately, to actually perform least-squares fitting, you do not have to 
write your own fitting functions, because there are very good fitting implemen- 
tations readily available. Both programs presented in Sec. 7.4, gnuplot and 
xmgrace, offer fitting to arbitrary functions. It is advisable to use gnuplot, since 
it offers higher flexibility for that purpose and gives you more information useful 
to estimate the quality of a fit. 

As an example, let us suppose that you want to fit an algebraic function of 
the form f{L) = Soo +o,L^ to the data set of the file sg_eO_L.dat shown on page 
42. First, you have to define the function and supply some rough (non-zero) 
estimations for the unknown parameters. Note that the exponential operator 
is denoted by ** and the standard argument for a function definition is x, but 
this depends only on your choice: 

gnuplot> f (x)=e+a*x**b 
gnuplot> e=-1.8 
gnuplot> a=l 
gnuplot> b=-l 

The actual fit is performed via the fit command. The program uses the non- 
linear least-squares Levenberg-Marquardt algorithm [Press et al. (1995)], which 




(7.86) 



80 



A.K. Hartmann: Practical Guide to Computer Simulations 



allows a fit data to almost all arbitrary functions. To issue the command, you 
have to state the fit function, the data set and the parameters which are to be 
adjusted. First, we consider the case where just two columns of the data are 
used or available (in this case, gnuplot assumes ai = 1). For our example you 
enter: 

gnuplot> fit f(x) "sg_eO_L.dat" via e,a,b 

Then gnuplot writes log information to the output describing the fitting 
process. After the fit has converged it prints for the given example: 

After 17 iterations the fit converged. 

final sum of squares of residuals : 7.55104e-06 
rel. change during last iteration : -2.42172e-09 

degrees of freedom (ndf) : 5 

rms of residuals (stdfit) = sqrt(WSSR/ndf ) : 0.00122891 

variance of residuals (reduced chisquare) = WSSR/ndf : 1.51021e-06 

Final set of parameters Asjrmptotic Standard Error 



e = -1.78786 +/- 0.0008548 (0.04781%) 

a = 2.5425 +/- 0.2282 (8.976"/.) 

b = -2.80103 +/- 0.08265 (2.9517.) 



correlation matrix of the fit pairameters: 

e a b 

e 1.000 
a 0.708 1.000 

b -0.766 -0.991 1.000 



The most interesting lines are those where the rcsiilts 9 for your parameters 
along with the standard error bar are printed. Additionally, the quality of the 
fit can be estimated by the information provided in the three lines beginning 
with "degree of freedom". The first of these lines states the number of degrees 
of freedom, which is just n — rip. The mean-squared difference x? is denoted as 

WSSR in the gnuplot output. A measure of quality of the fit is the probability 

Q that the value of the mean-squared difference is equal or larger compared to 
the value from the current fit, given the assumption that the data points are 
distributed as in Eq. (7.85) [Press et al. (1995)]. The larger the value of Q, the 
better is the quality of the fit. As mentioned above, Q can be evaluated from a 
chi-squared distribution with n — Up degrees of freedom. To calculate Q using 
the gnuplot output you can use the little program Q . c 



These "error bars" are calculated in a way which is in fajct correct only when fitting linear 
functions; hence, they have to be taJjen with care. 
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#include <stdio.h> 
#include <stdlib.h> 
#include <math.h> 
#include <gsl/gsl_cdf .h> 

int mainCint argc, char *airgv[]) 
{ 

double WSSRndf; 
int ndf ; 

if (argc != 3) 
{ 

printf ("USAGE 7.s <ndf> <WSSR/ndf >\n" , argv[0]); 
exit(l) ; 

> 

ndf = atoi(argv[l] ) ; 

sscanf (argv[2] , '"/.If", ftWSSRndf); 

printf ("# Q=y.e\n", gsl_cdf_chisq_Q(ndf *WSSRndf , ndf)); 
return (0) ; 

} 

which uses the gsl_cdf _chisq_q() function from the GSL (see Sec. 6.3). The 
program is called in the form Q <ndf > <WSSR/ndf>, which can be taken from 
the gnuplot output. Note that in this case we obtain (5 = 1, which is so large, 
because ai = 1 was used, see below. 

To watch the result of the fit along with the original data, just enter 

gnuplot> plot "sg_eO_L.dat" w e, f(x) 

The result is displayed in Fig. 7.29. Please note that the convergence depends 
on the initial choice of the parameters. The algorithm may be trapped into a 
local minimum in case the parameters are too far away from the best values. Try 
the initial values e=l, a=-3 and b=l! Furthermore, not all function parameters 
have to be subjected to the fitting. Alternatively, you can set some parameters 
to fixed values and omit them from the via list at the end of the fit command. 
Remember that in the above example all data points enter into the result with 
the same weight, i.e. (t, = 1 Vi is assumed. You can tell the algorithm to consider 
the error bars, for example supplied in the third column, by typing 

gnuplot> fit f(x) "sg_eO_L.dat" using 1:2:3 via e,a,b 

Then, data points with larger error bars have less influence on the results. In 
this case a different result whith smaller value of Q will arise (try it !). 

Finally, you can also restrict the data points which are considered for the 
flt, which is applicable if only a subset of the sample follows the function law 
you are considering. This can be done in the same way as restricting the range 
of plotted values, for instance using 
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Figure 7.29: Gnuplot window showing the result of a fit command along with 
the input data. 



gnuplot> fit [5:12] f(x) "sg_eO_L.dat" using 1:2:3 via e,a,b 

More information on how to use the fit command, such as fitting higher- 
dimensional data, can be obtained when using the gnuplot online help via en- 
tering help fit. 
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Exercises 

(solutions: see CD enclosed with book) 

1. Sampling from discrete distribution 

Design, implement and test a function, which 
returns a random number which is distributed 
according to some discrete distribution function 
stored in an array F, as describe in Sec. 7.2.2. 
The function prototype reads as follows: 

/******************** ]rand_discre"te ( ) 



/** Returns natural random number distributed **/ 
/** according a discrete distribution given by the **/ 

/** distribution function in array 'F' **/ 

/** Uses search in array to generate number **/ 

/** PARAMETERS: (*)= return-paramter **/ 

/** n: nimber of entries in array **/ 

/** F: array with distribution fimction **/ 

/** RETURNS: **/ 

/** random number **/ 



int rand_discrete(int n, double *F) 

For simplicity, you can use the drand48() function from the standard C library 
to generate random numbers distributed according to U{0, 1). 
Furthermore, design, implement and test a function, which allocates and ini- 
tializes the array F for a Poisson distribution with parameter fj,, see Eq. (7.27) 
for the probability mass function. The function should determine automatically 
how many entries of F are needed, depending on the paramater fi. The function 
prototype reads as follows: 

init_poisson() *************+*+*/ 



/** Generates array with distribution function **/ 

/** for Poisson distribution with mean mu: **/ 

/** p(k)=mu"k*exp(-mu)/x! **/ 

/** The size of the array is automatically adjusted. **/ 

/** PARAMETERS: (*)= return-paramter **/ 

/** (*) n_p: p. to number of entries in table **/ 

/** mu: parameter of distribution **/ 

/** RETURNS: **/ 

/** pointer to array with distribution function **/ 



double *init_poisson(int *n_p, double mu) 

Hints: To determine the array sizes, you can first loop over the probabilities and 
take the first value k_0 where p(k_0) = within the precision of the numerics. 
This value of k_0 serves as array size. Alternatively, you start with some size 
and extend the array if needed by doubling its size. For testing purposes, you 
can generate many numbers, calculate the mean and compare it with jj,. Al- 
ternatively, you could record a histogram (see Chap. 3) and compare with Eq. 
(7.27). 



SDLUTIOM SOURCE CODE 

DIR: randomness 
FILE(S): poisson. c 
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2. Inversion Method for Fisher-Tippett distribution 



SOLUTION SOURCE CODE 



DIR: randomness 

FILE(S): 

f ischer_tippett . c 



Design, implement and test a function, wliicli 
returns a random number which is distributed 
according to the Fisher-Tippett distribution 
Eq. (7.43) with parameter A. Use the inversion 
method. 

The function prototype reads as follows: 

/******************** rand_f isher_'tippe't't () ***********/ 
/** Returns random number which is distributed **/ 
/** according the Fisher-Tippett distribution **/ 
/** PARAMETERS: (*)= return-paramter **/ 
/** lambda: parameter of distribution **/ 

/** RETURNS: **/ 
/** random number **/ 

double rand_fisher_tippett (double lambda) 

Remarks: For simplicity, you can use the drand48() function from the standard 
C library to generate random numbers distributed according to ?7(0, 1). To 
test your function, you can calculate the mean of the generated numbers, for 
instance, and compare it with the expectation value ~ 0.57721/A. 

Variance of data sample 

Design, implement and test a function, which 
calculates the variance s'^ of a sample of data 
points. Use directly Eq. (7.51), i.e. do not use 
an equivalent form of Eq. (7.21), since this form 
is more susceptible to rounding errors. 
The function prototype reads as follows: 



SOLUTION SOURCE CODE 



DIR: randomness 
FILE(S): variance. c 



/********************** varianceO **************+***++/ 

/** Calculates the vairiance of n data points **/ 

/** PARAMETERS: (*)= return-paramter **/ 

/** n: number of data points **/ 

/** x: airray with data **/ 

/** RETURNS: **/ 

/** variance **/ 

double variance (int n, double *x) 

Remark: The so-called corrected double-pass algorithm [Chan et al. (1983)] 
aims at further reducing the rounding error. It is based on the equation 



2 1 
s = — 



The second would be zero for exact arithmetic and accounts for rounding orros 
occurring in the second term. It becomes important in particular if the expec- 
tation value is large. Perform experiments for generating Gaussian distributed 
number with cr^ = 1 and p = 10^*, without and with the correction. 



7.6. GENERAL ESTIMATORS 85 



4. Bootstrap 

Design, implement and test a function, which 
uses bootstrapping to calculate the confidence 
interval at significance level a given in Eq. 
(7.66). 

The function prototype reads as follows: 

/***************** boo"ts"trap_ci () *********************/ 
/** Calculates a confidence interval by 'n_resample' **/ 



/** 


times resampling the given sample points 


**/ 


/** 


and each time evaluation the estimator 'f ' 


**/ 


/** 


PARAMETERS: 


(*)= return-paramter 


+*/ 


/** 


n: 


number of data points 


**/ 


/** 


x: 


array with data 


**/ 


/** 


n_resample : 


number of bootstrap iterations 


**/ 


/** 


alpha : 


confidence level 


**/ 


/** 


f : 


function (pointer) = estimator 


+*/ 


/** 


(*) low: 


(p. to) lower boundary of conf . 


int . **/ 


/** 


(*) high: 


(p. to) upper boundary of conf. 


int . **/ 


/** 


RETURNS: 




**/ 


/** 


(nothing) 


**/ 



/**********************+++*++++*++*+**♦*+**♦*+**♦**++♦*/ 
void bootstrap_ci(int n, double *x, int n_resample, 

double alpha, double (*f)(int, double *) , 

double *low, double *high) 

Hints: Use the function bootstrap_variance() as example. To get the entries 
at the positions defined via Eq. (7.66), you can sort the bootstrap sample first 
using qsortO, see Sec. 6.1. 

You can test your function by using the provided main file bootstrap_test . c, 
the auxiliary files mean.c and veiriance.c and by compiling with 
cc -o bt bootstrap_test . c bootstrap_ci.c mean.c -Im -DSOLUTION. Note 
that the macro definition -DSOLUTION makes the mainO function to call 
bootstrap_ci() instead of bootstrap.varianceO. 

5. Plotting data 

Plot the data file FTpdf . dat using xmgrace. 
The file contains a histogram pdf generated for 
the Fislior-Tippott distribution. The file format 
is 1st column: bin number, 2nd: bin midpoint, 
3rd: pdf value, 4th: error bar. Use 
the "block data" format to read the files (columns 2,3,4). Create a plot with 
inset. The main plot should show the histogram pdf with error bars and logarith- 
mically scaled y axis, the inset should show the data with linear axes. Describe 
the plot using a text label placed in the plot. Choose label sizes, line width and 
other styles suitably. Store the result as . agr file and export it to a postscript 
(eps) file. 

The result should look similar to: 



SOLUTION SOURCE CODE 

DIR: randomness 
FILE(S): 
bootstrap_ci.c 



SOLUTION SOURCE CODE 

DIR: randomness 
FILE(S): FTplot.agr 
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SOLUTION SOURCE CODE 



DIR: randomness 
FILE(S): chi2hh.c 



6. Chi-squcired test 

Design, iinplcincnt and test a function, which 
calculates the test statistics for two his- 
tograms {hi}, {hi} according Eq. (7.69). The 
function should return the p-valuc, i.e. the 
cumulative probability ( "p- value" ) that a value of or larger is obtained under 
the assumption that the two histograms were obtained by sampling from the 
same (discrete) random variable. 
The function prototype reads as follows: 



/** For chi"2 test: comparison of two histograms **/ 

/** to probabilities: Probability to **/ 

/** obtain the corresponding chi2 value or worse. ♦*/ 

/** It is assumed that the total number of data points**/ 

/*+ in the two histograms is equal ! **/ 

/** **/ 

/** Parameters: (*) = return pairameter **/ 



/** n_bins 
/** h 
/** h2 
/** 

/** Returns : 

/** p-value 



number of bins 
array of histogram values 
2nd array of histogram values 



**/ 
**/ 
**/ 
**/ 
**/ 
**/ 



double chi2_hh(int n_bins, int *h, int *h2) 



Hints: Use the functio chi2_hd() as example. Include a test, which verifies that 
the total number of counts in the two histograms agree. 

To test the function: Generate two histograms according to a binomial distri- 
bution with parameters n = par_n= 10 and p = 0.5 or p = par_p. Perform a 
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loop for different values of par_p and calculate the p-value each time using the 
gsl_cdf _chisq_C!() function of the GNU scientific library (GSL) (see Sec. 6.3). 

7. Linear correlation coefficient 

Design, implement and test a function, which 
calculates the linear correlation coefficient r to 
measure the strength of a correlation for a sam- 
ple {{x(),y()), (xi.yi), . . . , (xri-i,yn-i)}. 
The function prototype reads as follows: 

^ :|c :|e 3|c :|e :|e 3|c :|e :|e 3|c :|e :|e 3|c :|e 9): :|e :|c :|e :|e 3|c :|e :(£ ^ :(£ :(£ ICCC} i^^i^^il^^il^^il^^il^^il^^il^^il^^il^^ / 



/** Calculates the lineair correlation coefficient **/ 

/** **/ 

/** Parameters: (*) = return parameter **/ 

/** n: number of data points **/ 

/** x: first element of sample set **/ 

/** y: second element of sample set **/ 

/** **/ 

/** Returns: *+/ 

/** r **/ 



double IccCint n, double *x, double *y) 

Remark: Write a mainO function which generates a sample in the following way: 

the Xi numbers are generated from a standard Gaussian distribution A'^(0, 1) 
while each j/i number is drawn from a Gaussian distribution with expectation 
value KXi (variance 1). Study the result for diflferent values of k and n. 

8. Least-squares fitting 

Copy the program from exercise (2) to a new 
program and change it such that numbers for 
a shifted Fisher-Tippett with parameters A and 
peak position xo arc generated. The numbers 
should be stored in a histogram and a histogram 
pdf should be written to the standard output. 

• Choose the histogram parameters (range, bin range) such that the his- 
tograms match the generated data well. 

• Run the program to generate n — W'' numbers for parameters .ro = 2.0 
and A = 3.0. Pipe the histogram pdf to a file (e.g. using > ft.dat at the 
end of the call) . 

• Plot the result using gnuplot. 

• Define the pdf for the Fisher-Tippett distribution in gnuplot and fit the 
function to the data with xo and A as adjustable parameters. Choose a 
suitable range for the fit. 

• Plot the data together with the fitted function. 

• How does the result compare to the maximum-likelihood fit presented in 
Sec. 7.6.1? 

• Does the fit (in particular for A) get better if you increase the number of 
sample points to 10®? 



SOLUTION SOURCE CODE 

DIR: randomness 
FILE(S): Icc.c 



SOLUTION SOURCE CODE 

DIR: randomness 
FILE(S): fitFT.gp 
f isher_tippett2 . c 
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Hints: The shift is implemented by just adding xo to the generated random 
number. Use either the histograms from Chap. 3, or implement a "poor-mans 
histogram" via an array hist ( see also in the mainO function of the reject.c 
program partly presented in Sec. 7.2.4). 
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